Reference documentation for Baseten’s Training SDK classes and configuration.
TrainingJob
enable_cache=True
is set in your Runtime
, the training cache will be enabled.
The cache will be mounted at two locations:
/root/.cache/huggingface
$BT_RW_CACHE_DIR
- Baseten will export this variable in your job’s environment.$BT_CHECKPOINT_DIR
within
the Training Job’s environment. The checkpointing storage is independent of the ephemeral stroage of the pod
Environment Variable | Description | Example |
---|---|---|
BT_TRAINING_JOB_ID | ID of the Training Job | "gvpql31" |
BT_NUM_GPUS | Number of available GPUs per node | "4" |
BT_RW_CACHE_DIR | Non-HuggingFace cache directory of the training cache mount | "/root/.cache/user_artifacts" |
BT_CHECKPOINT_DIR | Directory of the automated checkpointing mount | "/mnt/ckpts" |
Environment Variable | Description | Example |
---|---|---|
BT_GROUP_SIZE | Number of nodes in the multinode deployment | "2" |
BT_LEADER_ADDR | Address of the leader node | "10.0.0.1" |
BT_NODE_RANK | Rank of the node | "0" |
29500
) will work. There is no specific port number required by Baseten.