Training
Reference documentation for Baseten’s Training SDK classes and configuration.
The Training SDK provides classes for configuring and managing machine learning model training jobs on Baseten. This reference documents the key classes used to define training configurations.
Deploy a TrainingJob
To deploy a training job, use the following command:
The following classes are used to configure and deploy training jobs:
TrainingJob
Defines a complete training job configuration.
Example usage:
TrainingProject
Organizes training jobs and provides project-level configuration.
Example usage:
Image
Specifies the container image for the training environment.
Example usage:
Compute
Specifies compute resources for training jobs.
Example usage:
Runtime
Defines the runtime environment for training jobs.
Example usage:
Training Cache
When enable_cache=True
is set in your Runtime
, the training cache will be enabled.
The cache will be mounted at two locations:
/root/.cache/huggingface
$BT_RW_CACHE_DIR
- Baseten will export this variable in your job’s environment.
The cache storage is separate from ephemeral storage limits of your training job. Training Projects provide storage segragation within the cache. Training jobs within the same project share the same cache, while training jobs in different projects cannot access each other’s data.
SecretReference
Used to securely reference secrets stored in your Baseten workspace.
Example usage:
CheckpointingConfig
Configures model checkpointing behavior during training. Baseten will export the $BT_CHECKPOINT_DIR
within
the Training Job’s environment. The checkpointing storage is independent of the ephemeral stroage of the pod
Example usage:
Baseten Provided Environment Variables
Baseten automatically provides several environment variables in your training job’s environment to help integrate your code with the Baseten platform.
Environment Variables
Environment Variable | Description | Example |
---|---|---|
BT_TRAINING_JOB_ID | ID of the Training Job | "gvpql31" |
BT_NUM_GPUS | Number of available GPUs per node | "4" |
BT_RW_CACHE_DIR | Non-HuggingFace cache directory of the training cache mount | "/root/.cache/user_artifacts" |
BT_CHECKPOINTING_DIR | Directory of the automated checkpointing mount | "/tmp/checkpoints" |
Multinode Environment Variables
The following environment variables are particularly useful for multinode training jobs:
Environment Variable | Description | Example |
---|---|---|
BT_GROUP_SIZE | Number of nodes in the multinode deployment | "2" |
BT_LEADER_ADDR | Address of the leader node | "10.0.0.1" |
BT_NODE_RANK | Rank of the node | "0" |
For multinode deployments, any traditionally used port number (e.g. 29500
) will work. There is no specific port number required by Baseten.
Deploy Checkpoints as a Model
These classes should be used with the following command
DeployCheckpointsRuntime
Configures the runtime environment for deployed checkpoints.
Checkpoint
Represents metadata for a saved model checkpoint.
CheckpointList
Manages a collection of checkpoints and their download configuration.
DeployCheckpointsConfig
Specifies configuration for deploying trained model checkpoints.
Example usage: