- Resume failed training jobs.
- Incremental training and fine-tuning.
Accessing downloaded checkpoints
Checkpoints are available through theBT_LOAD_CHECKPOINT_DIR environment variable. For single-node training, they’re located in BT_LOAD_CHECKPOINT_DIR/rank-0/. For multi-node training, each node’s checkpoints are in BT_LOAD_CHECKPOINT_DIR/rank-<node_rank>/.
Checkpoint reference
Create references to checkpoints using theBasetenCheckpoint factory:
From latest
project_name: Load the latest checkpoint from the most recent job in this project.job_id: Load the latest checkpoint from this specific job.- Both parameters: Load the latest checkpoint from that specific job in that project.
From named
checkpoint_name: The name of the specific checkpoint to load.job_id: The job that contains the named checkpoint.- Both parameters: Load the named checkpoint from that specific job in that project.
Configuration examples
Here are practical examples of how to configure checkpoint loading in your training jobs:From latest
From named
enabled: Set toTrueto enable checkpoint loading.checkpoints: List containing checkpoint references.download_folder: Optional custom download location (defaults to/tmp/loaded_checkpoints).
Complete TrainingJob setup
Using checkpoints in your training code
Access loaded checkpoints using theBT_LOAD_CHECKPOINT_DIR environment variable: