Installation
Truss includes the training SDK:- uv (recommended)
- pip (macOS/Linux)
- pip (Windows)
uv is a fast Python package manager. Create a virtual environment and install Truss:
config.py). Import the SDK and accelerator config:
config.py
truss_train (for example, from truss_train import Compute, Runtime).
Complete example
Copy thisconfig.py as a starting point for your training project. It configures caching to persist pip packages between jobs, checkpointing to save model weights, and GPU compute on a single H200 node. Modify the start_commands, environment_variables, and accelerator fields for your use case. For more examples, see ml-cookbook.
config.py
push
Submits a training job to Baseten. Every config you define with the classes below does nothing until you callpush().
When you call push(), Baseten:
- Authenticates with your Baseten account.
- Creates the training project if one with the given name doesn’t already exist, or reuses the existing project.
- Archives your source directory (your training script, data files, and any other local files) and uploads it.
- Submits a new training job. Baseten provisions the hardware, pulls the container image, mounts any BDN weights, extracts your source files into the container, and runs your start_commands.
CREATED: Baseten has received the training configuration.DEPLOYING: Baseten is provisioning compute resources and installing dependencies.RUNNING: Your training code is actively executing.COMPLETED: The job has finished. Checkpoints and artifacts have been saved.DEPLOY_FAILED: The job failed to deploy, likely due to a bad image or resource allocation issue.FAILED: The job encountered an error. Check the logs for details.STOPPED: The job was manually stopped.
truss train push config.py performs the same steps with additional options for team selection and flag overrides.
The push function accepts either a file path or a TrainingProject object.
config.py
Parameters
Path to a
config.py file or a TrainingProject instance. When you pass a Path, Baseten imports the module and scans for an instance of TrainingProject. The module must contain exactly one.Remote provider to push to. Defaults to
baseten.Root directory whose contents Baseten uploads as the job’s working directory. Baseten archives this directory and extracts it into the container before running start_commands. Only applies when
config is a TrainingProject. Defaults to the current directory.Return value
Returns a dictionary containing the created training job. Use theid and training_project.id values to monitor the job, stream logs, and list checkpoints.
Output
TrainingProject object to push():
submit_job.py
Output
After submitting
Oncepush() returns, Baseten queues your job and begins provisioning. Use the returned job ID to track progress:
- Stream logs:
truss train logs --job-id <job_id> --tail - Check status:
truss train view --job-id <job_id> - List checkpoints: Use the get training job checkpoints API.
- Deploy a checkpoint: For more information, see deploy checkpoints.
config.py-based submission with the CLI, see the training getting started guide.
TrainingProject
Groups related training jobs under a single named project. When you push aTrainingProject, Baseten creates the project if it doesn’t exist, then submits the attached TrainingJob. All jobs in a project share the same project-level cache and appear together in the dashboard.
config.py
Parameters
Project name. Reusing a name adds jobs to the existing project.
Training job to submit. Defines the container image, compute resources, runtime commands, and optional weights. For more information, see TrainingJob.
Team that owns this project. Controls access and team-level cache scope.
TrainingJob
Represents a single training run. Baseten provisions the hardware specified in Compute, pulls the container Image, uploads your source directory, mounts any WeightsSource volumes, then executes the Runtime start commands. For more information, see the training lifecycle.config.py
Parameters
Docker image that provides the training environment, including the OS, CUDA drivers, and pre-installed libraries. For more information, see Image.
Hardware allocation for each node. Set the GPU type and count via
accelerator, and increase node_count for distributed training. Defaults to Compute(). For more information, see Compute.Controls container startup: shell commands to execute, environment variables to inject, and whether to enable caching or checkpointing. Defaults to
Runtime(). For more information, see Runtime.Display name for this job in the dashboard and API responses.
Opens an rSSH tunnel so you can attach VS Code or Cursor to the running container for live debugging. For more information, see InteractiveSession.
Controls which local files Baseten uploads to the container. Use this to exclude large directories, include files from outside the root, or change the root entirely. For more information, see Workspace.
Model weights that BDN mirrors and mounts read-only in the container. Supports Hugging Face, S3, GCS, Azure, R2, and direct URLs. For more information, see WeightsSource.
WeightsSource
Mounts pre-trained model weights into the training container as a read-only volume. Baseten mirrors the weights through BDN before provisioning compute, so the data is ready when your container starts. On subsequent jobs with the same source, BDN serves the cached copy, which avoids re-downloading.- Hugging Face
- S3 with auth
- File filtering
config.py
Parameters
URI with scheme prefix.
For Hugging Face sources, pin to a specific revision with the
| Scheme | Example | Description |
|---|---|---|
hf:// | hf://meta-llama/Llama-3.1-8B@main | Hugging Face Hub. |
s3:// | s3://my-bucket/path/to/data | Amazon S3. |
gs:// | gs://my-bucket/path/to/data | Google Cloud Storage. |
azure:// | azure://account/container/path | Azure Blob Storage. |
r2:// | r2://account_id.bucket/path | Cloudflare R2. |
https:// | https://example.com/model.bin | Direct URL download. |
@revision suffix (branch, tag, or commit SHA).Absolute path where Baseten mounts the weights in the container.
Authentication configuration. See the BDN configuration reference.
Baseten secret name for credentials.
File patterns to include during download.
File patterns to exclude during download.
Image
Sets the Docker image that Baseten pulls to create the training container. The image provides the OS, CUDA drivers, Python version, and any pre-installed libraries your training code needs. Use a public image from Docker Hub or a private image with DockerAuth.config.py
Parameters
Full Docker image tag, such as
"pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime".Credentials for pulling from private registries like AWS ECR or Google Container Registry. Store actual credentials as Baseten secrets. For more information, see DockerAuth.
DockerAuth
Provides credentials for pulling images from private Docker registries (AWS ECR, Google Container Registry, etc.). Store the actual credential values as secrets in your Baseten workspace and reference them with SecretReference.Authentication method.
Docker registry URL.
IAM credentials for authenticating with AWS ECR. Requires
access_key_secret_ref and secret_access_key_secret_ref. For more information, see AWSIAMDockerAuth.Service account JSON credentials for authenticating with Google Container Registry. For more information, see GCPServiceAccountJSONDockerAuth.
Username/password credentials for authenticating with registries that support static credentials (Docker Hub, GHCR, NGC). Not compatible with AWS ECR or GCP Artifact Registry. For more information, see RegistrySecretDockerAuth.
AWSIAMDockerAuth
Authenticates with AWS ECR using IAM credentials.config.py
AWS access key ID, stored as a Baseten secret and referenced by name.
AWS secret access key, stored as a Baseten secret and referenced by name.
GCPServiceAccountJSONDockerAuth
Authenticates with Google Container Registry using service account JSON.config.py
GCP service account JSON, stored as a Baseten secret and referenced by name.
RegistrySecretDockerAuth
Authenticates with registries that support static username/password credentials, including Docker Hub, GHCR, and NGC. For AWS ECR or GCP Artifact Registry, use AWSIAMDockerAuth or GCPServiceAccountJSONDockerAuth instead.config.py
Registry credentials in
username:password format (plaintext, not Base64-encoded), stored as a Baseten secret and referenced by name.Compute
Defines the hardware Baseten allocates for each training job. Setnode_count above 1 for multi-node distributed training, which provisions multiple identical nodes and injects coordination environment variables (BT_LEADER_ADDR, BT_NODE_RANK, BT_GROUP_SIZE).
config.py
Parameters
Number of nodes to provision. Each node gets the full CPU, memory, and GPU allocation.
CPU cores per node.
RAM per node (for example,
"64Gi"). Defaults to 2Gi.GPU type and count per node. For more information, see AcceleratorSpec.
AcceleratorSpec
Selects the GPU type and count per node. Thecount determines how many GPUs are available to your training script on each node (exposed as $BT_NUM_GPUS).
GPU type.Available options:
A10G: NVIDIA A10G.H200: NVIDIA H200.
Number of GPUs per node.
Runtime
Controls what happens when the training container starts. Baseten executesstart_commands in order inside the container. Use them to install dependencies, set up data, and launch your training script. Baseten injects environment variables before the first command runs; use SecretReference for sensitive values like API keys so they aren’t stored in your config file.
config.py
Parameters
Shell commands that Baseten executes sequentially when the container starts.
Key-value pairs that Baseten injects as env vars. Use SecretReference for sensitive values.
Enables writing model checkpoints to persistent storage. When enabled, Baseten mounts a volume and exports
$BT_CHECKPOINT_DIR. Defaults to CheckpointingConfig(). For more information, see CheckpointingConfig.Enables a persistent read-write cache that survives across jobs for pip packages, model downloads, and preprocessed datasets. For more information, see CacheConfig.
Downloads checkpoints from a previous job into the container before
start_commands run. Use this to resume training or initialize weights from an earlier experiment. For more information, see LoadCheckpointConfig.Use
cache_config with enabled=True instead.SecretReference
Injects a secret stored in your Baseten workspace as an environment variable at runtime. Baseten never writes the value to your config file or source code. Use this for API keys, tokens, and credentials.config.py
Name of the secret as it appears in your workspace settings.
CheckpointingConfig
Enables persistent checkpoint storage for the training job. Whenenabled is true, Baseten mounts a persistent volume and exports $BT_CHECKPOINT_DIR as an environment variable pointing to it. Your training script writes model weights, optimizer state, or any artifacts to that directory. These checkpoints survive job termination and can be deployed to inference or loaded into future jobs. See the checkpointing guide for best practices.
config.py
Set to
true to mount a persistent checkpoint volume.Override the default checkpoint directory path.
Size of the checkpoint volume in GiB. Defaults to a platform-managed size.
CacheConfig
Enables a persistent read-write cache that survives across jobs. Use the cache for pip packages, downloaded model weights, preprocessed datasets, or any data you don’t want to re-download on every run. Whenenabled is true, Baseten mounts two shared directories into the container. When require_cache_affinity is true (the default), Baseten schedules the job on a node that already has cached data, which avoids cold starts. See the cache guide for usage patterns.
config.py
| Environment variable | Description |
|---|---|
$BT_PROJECT_CACHE_DIR | Shared across all jobs in the same TrainingProject. Use for project-specific datasets or compiled artifacts. |
$BT_TEAM_CACHE_DIR | Shared across all jobs in the same team. Use for common model weights or shared libraries. |
Set to
true to mount persistent cache volumes.Mount the Hugging Face cache at the legacy path for backward compatibility.
Schedule the job on a node with existing cached data when possible.
Base path where Baseten mounts cache directories. Defaults to
/root/.cache.LoadCheckpointConfig
Downloads checkpoints from previous training jobs into the container beforestart_commands run. Use this to resume training from a saved state or to initialize weights from an earlier experiment. Baseten downloads the specified checkpoints to download_folder (also exported as $BT_LOAD_CHECKPOINT_DIR) and your training script reads them at startup. For more information, see the loading checkpoints walkthrough.
config.py
Set to
true to download checkpoints before start_commands run.One or more checkpoint references to download. Create references with
BasetenCheckpoint.from_latest_checkpoint() or BasetenCheckpoint.from_named_checkpoint(). For more information, see BasetenCheckpoint.Directory where Baseten downloads checkpoints. Exported as
$BT_LOAD_CHECKPOINT_DIR. Defaults to /tmp/loaded_checkpoints.BasetenCheckpoint
Creates references to checkpoints saved by previous training jobs. Pass these references to LoadCheckpointConfig to download checkpoint data into your container at job start. You can reference checkpoints by project name (gets the most recent), by job ID (gets the most recent from that job), or by exact checkpoint name and job ID.config.py
from_latest_checkpoint
Returns a reference to the most recent checkpoint from a project or job. At least one ofproject_name or job_id is required.
Project name to get the latest checkpoint from.
Job ID to get the latest checkpoint from.
from_named_checkpoint
Returns a reference to a specific checkpoint by its name and job ID.Checkpoint name.
Job ID.
Workspace
Controls which local files Baseten uploads to the training container. By default, Baseten archives the directory containing yourconfig.py (or the source_dir you pass to push) and extracts it into the container’s working directory. Use Workspace to customize this behavior: exclude large data directories, include files from outside the root, or change the root entirely.
config.py
Parameters
Override the root directory to archive. Defaults to the config file’s parent directory.
Additional directories outside
workspace_root to include in the upload.Directories to exclude from the upload (for example,
"data", ".git", "__pycache__").InteractiveSession
Opens an rSSH tunnel to the training container so you can attach VS Code or Cursor for live debugging. The tunnel stays active fortimeout_minutes, then closes automatically. Use trigger to control when the session starts: immediately on job start, only when training fails, or on-demand from the dashboard. See the interactive sessions guide for setup details.
config.py
Parameters
Controls when to activate the session. Defaults to
ON_DEMAND.Available options:ON_STARTUP: active from job start.ON_FAILURE: activates when training exits with a non-zero code.ON_DEMAND: activates when you change the trigger on a running job.
Minutes before the session expires. Set to
-1 to extend the expiry to 10 years.IDE for the remote tunnel. Defaults to
VS_CODE.Available options:VS_CODE: VS Code Remote Tunnels.CURSOR: Cursor Remote Tunnels.
Authentication provider for the device code flow. Defaults to
MICROSOFT.Available options:GITHUB: authenticate via GitHub.MICROSOFT: authenticate via Microsoft.
Environment variables
Baseten automatically injects these environment variables into every training container. Your training script can read them to discover job metadata, locate checkpoint and cache directories, and coordinate across nodes in multi-node jobs.Standard variables
| Variable | Description | Example |
|---|---|---|
BT_TRAINING_JOB_ID | Training job ID. | "gvpql31" |
BT_TRAINING_PROJECT_ID | Training project ID. | "aghi527" |
BT_TRAINING_JOB_NAME | Training job name. | "gpt-oss-20b-lora" |
BT_TRAINING_PROJECT_NAME | Training project name. | "gpt-oss-finetunes" |
BT_NUM_GPUS | Number of GPUs per node. | "4" |
BT_CHECKPOINT_DIR | Checkpoint save directory. | "/mnt/ckpts" |
BT_LOAD_CHECKPOINT_DIR | Loaded checkpoints directory. | "/tmp/loaded_checkpoints" |
BT_PROJECT_CACHE_DIR | Project-level cache directory. | "/root/.cache/user_artifacts" |
BT_TEAM_CACHE_DIR | Team-level cache directory. | "/root/.cache/team_artifacts" |
BT_RW_CACHE_DIR | Base read-write cache directory. | "/root/.cache" |
BT_RETRY_COUNT | Job retry attempt count. | "0" |
Multi-node variables
For distributed training across multiple nodes:| Variable | Description | Example |
|---|---|---|
BT_GROUP_SIZE | Number of nodes in deployment. | "2" |
BT_LEADER_ADDR | Leader node address. | "10.0.0.1" |
BT_NODE_RANK | Node rank (0 for leader). | "0" |
Deploy checkpoints
Deploys trained model checkpoints from a completed training job to Baseten’s inference platform. Baseten downloads the checkpoint weights, packages them with a serving runtime, and creates a deployable model endpoint. See the deployment guide for the full workflow.Deploy with CLI wizard
Deploy checkpoints interactively with the CLI wizard:The
deploy_checkpoints command doesn’t support FSDP checkpoints. Configure these manually in the Truss config.For optimized inference with TensorRT-LLM, see Deploy checkpoints with Engine Builder.
Deploy with static configuration
Create a Python config file for repeatable deployments:DeployCheckpointsConfig
Defines how to deploy checkpoints from a completed training job to a Baseten inference endpoint. Baseten reads the checkpoint weights, selects the correct serving backend based on the model weights format (full, LoRA, or Whisper), and provisions the specified Compute resources.deploy_config.py
Parameters
Checkpoints to deploy, including the base model ID for LoRA and one or more checkpoint references. For more information, see CheckpointList.
Name for the deployed model in the Baseten dashboard.
Environment variables for the inference runtime, such as API keys or serving configuration. For more information, see DeployCheckpointsRuntime.
GPU and memory allocation for the inference endpoint. Uses the same Compute configuration as training jobs.
DeployCheckpointsRuntime
Sets environment variables for the deployed inference endpoint. Use this to inject API keys or configuration that the serving runtime needs.Key-value pairs that Baseten injects as env vars. Use SecretReference for sensitive values.
CheckpointList
Groups one or more checkpoints for deployment. For LoRA deployments, setbase_model_id to the Hugging Face model ID you trained the adapters on.
Directory where Baseten downloads checkpoint files during deployment. Defaults to
/tmp/training_checkpoints.Hugging Face model ID for the base model. Required for LoRA deployments.
One or more FullCheckpoint, LoRACheckpoint, or WhisperCheckpoint instances.
Checkpoint types
Baseten supports three checkpoint types. Use the type that matches how your model was trained.FullCheckpoint
Deploys a complete set of model weights from a full fine-tune.Training job ID.
Checkpoint name.
Auto-set to
full.LoRACheckpoint
Deploys LoRA adapter weights on top of the base model you specify in CheckpointList.Training job ID.
Checkpoint name.
Auto-set to
lora.LoRA adapter configuration. Set
rank to match the rank you used during training. Defaults to LoRADetails(). Valid values:- 8, 16, 32, 64, 128, 256, 320, 512.
WhisperCheckpoint
Deploys fine-tuned Whisper model weights for speech-to-text inference.Training job ID.
Checkpoint name.
Auto-set to
whisper.LoRADetails
Sets the LoRA rank for adapter deployment. The rank must match the rank you set during training.LoRA rank. Valid values: 8, 16, 32, 64, 128, 256, 320, 512.