Benefits of checkpointing
- Avoid catastrophic out of disk errors: We mount additional storage at the checkpointing directory to help avoid out of disk errors during your training run.
- Maximize GPU utilization: When checkpointing is enabled, any data written to the checkpointing directory will be uploaded to the cloud by a separate process, allowing you to maximize GPU time spent training.
- Seamless checkpoint management: Checkpoints are automatically uploaded to cloud storage for easy access and management.
Enabling checkpointing
To enable checkpointing, add aCheckpointingConfig to the Runtime and set enabled to True:
Using the checkpoint directory
Baseten will automatically export the$BT_CHECKPOINT_DIR environment variable in your job’s environment.
Write your checkpoints to the
$BT_CHECKPOINT_DIR directory so Baseten can automatically backup and preserve them.