Deploy a TrainingJob
To deploy a training job, use the following command:
TrainingJob
Defines a complete training job configuration.TrainingProject
Organizes training jobs and provides project-level configuration.Image
Specifies the container image for the training environment.DockerAuth
Configures authentication for private Docker registries. Ensure that any SecretReference used has been set in your Baseten Workspace. See secrets for more details.AWSIAMDockerAuth
Authenticates with AWS ECR using IAM credentials.GCPServiceAccountJSONDockerAuth
Authenticates with Google Container Registry using service account JSON.Example
Example usage with GCP:Compute
Specifies compute resources for training jobs.Runtime
Defines the runtime environment for training jobs.Training Cache
By default, the training cache provides two mount locations:$BT_PROJECT_CACHE_DIR, which is shared and accessible by all jobs within a Project ID$BT_TEAM_CACHE_DIRwhich is shared by jobs that belong to the same Team.
SecretReference
Used to securely reference secrets stored in your Baseten workspace.CheckpointingConfig
Configures model checkpointing behavior during training. Baseten will export the$BT_CHECKPOINT_DIR within
the Training Job’s environment. The checkpointing storage is independent of the ephemeral stroage of the pod
Baseten Provided Environment Variables
Baseten automatically provides several environment variables in your training job’s environment to help integrate your code with the Baseten platform.Environment Variables
| Environment Variable | Description | Example |
|---|---|---|
BT_TRAINING_JOB_ID | ID of the Training Job | "gvpql31" |
BT_TRAINING_PROJECT_ID | ID of the Training Project | "aghi527" |
BT_NUM_GPUS | Number of available GPUs per node | "4" |
BT_PROJECT_CACHE_DIR | Directory shared across Training Jobs within a singular Training Project | "/root/.cache/user_artifacts" |
BT_TEAM_CACHE_DIR | Directory shared across Training Jobs within a singular Team | /root/.cache/team_artifacts |
BT_CHECKPOINT_DIR | Directory where checkpoints are automatically saved during training | "/mnt/ckpts" |
BT_LOAD_CHECKPOINT_DIR | Directory of where loaded checkpoints will be | "/tmp/loaded_checkpoints" |
BT_TRAINING_JOB_NAME | Name your the Training Job | "gpt-oss-20b-lora" |
BT_TRAINING_PROJECT_NAME | Name your the Training Project | "gpt-oss-finetunes" |
Multinode Environment Variables
The following environment variables are particularly useful for multinode training jobs:| Environment Variable | Description | Example |
|---|---|---|
BT_GROUP_SIZE | Number of nodes in the multinode deployment | "2" |
BT_LEADER_ADDR | Address of the leader node | "10.0.0.1" |
BT_NODE_RANK | Rank of the node | "0" |
29500) will work. There is no specific port number required by Baseten.
Deploy Checkpoints as a Model
Deploy checkpoints CLI wizard
The easiest way to deploy your checkpoints is by using the CLI wizard:deploy_checkpoints today and must be manually configured in the truss config.
Once you’ve completed the wizard, Baseten will generate a truss and deploy a published model according to the specs provided.