Benefits of Checkpointing
- Avoid catastrophic out of disk errors: We mount additional storage at the checkpointing directory to help avoid out of disk errors during your training run.
- Maximize GPU utilization: When checkpointing is enabled, any data written to the checkpointing directory will be uploaded to the cloud by a separate process, allowing you to maximize GPU time spent training.
- Seamless checkpoint management: Checkpoints are automatically uploaded to cloud storage for easy access and management.
Enabling Checkpointing
To enable checkpointing, add aCheckpointingConfig
to the Runtime
and set enabled
to True
:
Using the Checkpoint Directory
Baseten will automatically export the$BT_CHECKPOINT_DIR
environment variable in your job’s environment.
Write your checkpoints to the
$BT_CHECKPOINT_DIR
directory so Baseten can automatically backup and preserve them.