Deploy a TrainingJob
To deploy a training job, use the train push command:
TrainingJob
Defines a complete training job configuration.TrainingProject
Organizes training jobs and provides project-level configuration.my-team, you
set the following:
Image
Specifies the container image for the training environment.pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime base image, you can set the following:
DockerAuth
Configures authentication for private Docker registries. Ensure that any SecretReference used has been set in your Baseten Workspace. See secrets for more details.AWSIAMDockerAuth
Authenticates with AWS ECR using IAM credentials.GCPServiceAccountJSONDockerAuth
Authenticates with Google Container Registry using service account JSON.Compute
Specifies compute resources for training jobs.Runtime
Defines the runtime environment for training jobs.python train.py
start command, a BATCH_SIZE environment variable set to 32, and a
WANDB_API_KEY environment variable set to a secret reference named
WANDB_KEY, you can set the following:
Training Cache
By default, the training cache provides two mount locations:-
$BT_PROJECT_CACHE_DIR, which is shared and accessible by all jobs within a Project ID -
$BT_TEAM_CACHE_DIRwhich is shared by jobs that belong to the same Team.
SecretReference
Used to securely reference secrets stored in your Baseten workspace.CheckpointingConfig
Configures model checkpointing behavior during training. Baseten will export the$BT_CHECKPOINT_DIR
within the Training Job’s environment.
The checkpointing storage is independent of the ephemeral storage of the pod
Baseten provided environment variables
Baseten automatically provides several environment variables in your training job’s environment to help integrate your code with the Baseten platform.Environment variables
| Environment Variable | Description | Example |
|---|---|---|
BT_TRAINING_JOB_ID | ID of the Training Job | "gvpql31" |
BT_TRAINING_PROJECT_ID | ID of the Training Project | "aghi527" |
BT_NUM_GPUS | Number of available GPUs per node | "4" |
BT_PROJECT_CACHE_DIR | Directory shared across Training Jobs within a singular Training Project | "/root/.cache/user_artifacts" |
BT_TEAM_CACHE_DIR | Directory shared across Training Jobs within a singular Team | /root/.cache/team_artifacts |
BT_CHECKPOINT_DIR | Directory where checkpoints are automatically saved during training | "/mnt/ckpts" |
BT_LOAD_CHECKPOINT_DIR | Directory of where loaded checkpoints will be | "/tmp/loaded_checkpoints" |
BT_TRAINING_JOB_NAME | Name your Training Job | "gpt-oss-20b-lora" |
BT_TRAINING_PROJECT_NAME | Name your Training Project | "gpt-oss-finetunes" |
Multinode Environment Variables
The following environment variables are particularly useful for multinode training jobs:| Environment Variable | Description | Example |
|---|---|---|
BT_GROUP_SIZE | Number of nodes in the multinode deployment | "2" |
BT_LEADER_ADDR | Address of the leader node | "10.0.0.1" |
BT_NODE_RANK | Rank of the node | "0" |
29500) will work.
There is no specific port number required by Baseten.
Deploy Checkpoints as a Model
Deploy checkpoints CLI wizard
The easiest way to deploy your checkpoints is by using the CLI wizard:deploy_checkpoints
today and must be manually configured in the truss config.
Once you’ve completed the wizard, Baseten will generate a truss and deploy a published
model according to the specs provided.
Deploy checkpoints with static configuration
If you’d like to keep a static config of your checkpoint deploy, you can create a python config file defining the configuration you’d like to reference:DeployCheckpointsRuntime
Configures the runtime environment for deployed checkpoints.Checkpoint
Represents metadata for a saved model checkpoint.CheckpointList
Manages a collection of checkpoints and their download configuration.DeployCheckpointsConfig
Specifies configuration for deploying trained model checkpoints.checkpoint_1 from a training job with the id gvpql31, you can set the following: