Deploy a TrainingJob
To deploy a training job, use the following command:
TrainingJob
Defines a complete training job configuration.TrainingProject
Organizes training jobs and provides project-level configuration.Image
Specifies the container image for the training environment.DockerAuth
Configures authentication for private Docker registries. Ensure that any SecretReference used has been set in your Baseten Workspace. See secrets for more details.AWSIAMDockerAuth
Authenticates with AWS ECR using IAM credentials.GCPServiceAccountJSONDockerAuth
Authenticates with Google Container Registry using service account JSON.Example
Example usage with GCP:Compute
Specifies compute resources for training jobs.Runtime
Defines the runtime environment for training jobs.Training Cache
Whenenable_cache=True
is set in your Runtime
, the training cache will be enabled.
The cache will be mounted at two locations:
/root/.cache/huggingface
$BT_RW_CACHE_DIR
- Baseten will export this variable in your job’s environment.
SecretReference
Used to securely reference secrets stored in your Baseten workspace.CheckpointingConfig
Configures model checkpointing behavior during training. Baseten will export the$BT_CHECKPOINT_DIR
within
the Training Job’s environment. The checkpointing storage is independent of the ephemeral stroage of the pod
Baseten Provided Environment Variables
Baseten automatically provides several environment variables in your training job’s environment to help integrate your code with the Baseten platform.Environment Variables
Environment Variable | Description | Example |
---|---|---|
BT_TRAINING_JOB_ID | ID of the Training Job | "gvpql31" |
BT_NUM_GPUS | Number of available GPUs per node | "4" |
BT_RW_CACHE_DIR | Non-HuggingFace cache directory of the training cache mount | "/root/.cache/user_artifacts" |
BT_CHECKPOINT_DIR | Directory of the automated checkpointing mount | "/mnt/ckpts" |
Multinode Environment Variables
The following environment variables are particularly useful for multinode training jobs:Environment Variable | Description | Example |
---|---|---|
BT_GROUP_SIZE | Number of nodes in the multinode deployment | "2" |
BT_LEADER_ADDR | Address of the leader node | "10.0.0.1" |
BT_NODE_RANK | Rank of the node | "0" |
29500
) will work. There is no specific port number required by Baseten.
Deploy Checkpoints as a Model
Deploy checkpoints CLI wizard
The easiest way to deploy your checkpoints is by using the CLI wizard:deploy_checkpoints
today and must be manually configured in the truss config.
Once you’ve completed the wizard, Baseten will generate a truss and deploy a published model according to the specs provided.