Skip to main content
With checkpointing enabled, you can manage your model checkpoints seamlessly and avoid common training issues.

Benefits of Checkpointing

  • Avoid catastrophic out of disk errors: We mount additional storage at the checkpointing directory to help avoid out of disk errors during your training run.
  • Maximize GPU utilization: When checkpointing is enabled, any data written to the checkpointing directory will be uploaded to the cloud by a separate process, allowing you to maximize GPU time spent training.
  • Seamless checkpoint management: Checkpoints are automatically uploaded to cloud storage for easy access and management.

Enabling Checkpointing

To enable checkpointing, add a CheckpointingConfig to the Runtime and set enabled to True:
from truss_train import definitions

training_runtime = definitions.Runtime(
    # ... other configuration options
    checkpointing_config=definitions.CheckpointingConfig(enabled=True)
)

Using the Checkpoint Directory

Baseten will automatically export the $BT_CHECKPOINT_DIR environment variable in your job’s environment.
Write your checkpoints to the $BT_CHECKPOINT_DIR directory so Baseten can automatically backup and preserve them.

Serving Checkpoints

Once your training is complete, you can serve your model checkpoints using Baseten’s serving infrastructure. Learn more about serving checkpoints.
I