Baseten Training seamlessly integrates with Baseten’s model deployment capabilities. Once your TrainingJob has produced model checkpoints, you can deploy them as fully operational model endpoints. This feature works with HuggingFace compatible LLMs, allowing you to easily deploy fine-tuned language models directly from your training checkpoints with a single command. To leverage deploying checkpoints, first ensure you have a TrainingJob that’s running with a checkpointing_config enabled.
runtime = definitions.Runtime(
    start_commands=[
        "/bin/sh -c './run.sh'",
    ],
    checkpointing_config=definitions.CheckpointingConfig(
        enabled=True,
    ),
)
In your training code or configuration, ensure that your checkpoints are being written to the checkpointing directory, which can be referenced in via $BT_CHECKPOINT_DIR. The contents of this directory are uploaded to Baseten’s storage and made immediately available for deployment. (You can optionally specify a checkpoint_path in your checkpointing_config if you prefer to write to a specific directory). The default location is “/tmp/training_checkpoints”. To deploy your checkpoint(s) as a Deployment, you can:
  • run truss train deploy_checkpoints [--job-id <job_id>] and follow the setup wizard.
  • define an instance of a DeployCheckpointsConfig class (this is helpful for small changes that aren’t provided by the wizard) and run truss train deploy_checkpoints --config <path_to_config_file>.
Currently, the deploy_checkpoints command only supports LoRA and Full Fine Tune for Single Node LLM Training jobs.
When deploy_checkpoints is run, truss will construct a deployment config.yml and store it on disk in a temporary directory. If you’d like to preserve or modify the resulting deployment config, you can copy paste it into a permanent directory and customize it as needed. This file defines the source of truth for the deployment and can be deployed independently via truss push. See deployments for more details. After successful deployment, your model will be deployed on Baseten, where you can run inference requests and evaluate performance. See Calling Your Model for more details. To download the files you saved to the checkpointing directory or understand the file structure, you can run truss train get_checkpoint_urls [--job-id=<job_id>] to get a JSON file containing presigned URLs for each training job. The JSON file contains the following structure:
{
  "timestamp": "2025-06-23T13:44:16.485905+00:00",
  "job": {
    "id": "03yv1l3",
    "created_at": "2025-06-18T14:30:30.480Z",
    "current_status": "TRAINING_JOB_COMPLETED",
    "error_message": null,
    "instance_type": {
			"id": "H100:2x8x176x968",
			"name": "H100:2x8x176x968 - 2 Nodes of 8 H100 GPUs, 640 GiB VRAM, 176 vCPUs, 968 GiB RAM",
			"memory_limit_mib": 967512,
			"millicpu_limit": 176000,
			"gpu_count": 8,
			"gpu_type": "H100",
			"gpu_memory_limit_mib": 655360
		},
    "updated_at": "2025-06-18T14:30:30.510Z",
    "training_project_id": "lqz9o34",
    "training_project": {
      "id": "lqz9o34",
      "name": "checkpointing"
    }
  },
  "checkpoint_artifacts": [
    {
      "url": "https://bt-training-eqwnwwp-f815d6cd-19bf-4589-bfcb-da76cd8432c0.s3.amazonaws.com/training_projects/lqz9o34/jobs/03yv1l3/rank-0/checkpoint-24/tokenizer_config.json?AWSAccessKeyId=AKIARLZO4BEQO4Q2A5NH&Signature=0vdzJf0686wNE1d9bm4%2Bw9ik5lY%3D&Expires=1751291056",
      "relative_file_name": "checkpoint-24/tokenizer_config.json",
      "node_rank": 0
    }
    ...
  ]
}
Important notes about the presigned URLs:
  • The presigned URLs expire after 7 days from generation
  • These URLs are primarily intended for evaluation and testing purposes, not for long-term inference deployments
  • For production deployments, consider copying the checkpoint files to your Truss model directory and downloading them in the model’s load() function

Complex and Custom Use Cases

  • Custom Model Architectures
  • Weights Sharded Across Nodes (Contact Baseten for help implementating this)
Examine the structure of your files with truss train get_checkpoint_urls --job-id=<your-training-job-id>. If a file looks like this:
{
  "url": "https://bt-training-eqwnwwp-f815d6cd-19bf-4589-bfcb-da76cd8432c0.s3.amazonaws.com/training_projects/lqz9o34/jobs/03yv1l3/rank-4/checkpoint-10/weights.safetensors?AWSAccessKeyId=AKIARLZO4BEQO4Q2A5NH&Signature=0vdzJf0686wNE1d9bm4%2Bw9ik5lY%3D&Expires=1751291056",
  "relative_file_name": "checkpoint-10/weights.safetensors",
  "node_rank": 4
}
In your Truss configuration, add a section like this: Wilcards * match to to an arbitrary number of chars while ? matches to one.
training_checkpoints:
  download_folder: /tmp/training_checkpoints
  artifact_references:
    - training_job_id: <your-training-job-id>
      paths:
        - rank-*/checkpoint-10/ # Pull in all the files for checkpoint-10 across all nodes
When your model pod starts up, you can read the file from the path /tmp/training_checkpoints/rank-[node-rank]/[relative_file_name]. For the example above, the file can be read from:
/tmp/training_checkpoints/<your-training-job-id>/rank-4/checkpoint-10/weights.safetensors