Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

With checkpointing enabled, you can manage your model checkpoints seamlessly and avoid common training issues.

Benefits of checkpointing

  • Avoid catastrophic out of disk errors: We mount additional storage at the checkpointing directory to help avoid out of disk errors during your training run.
  • Maximize GPU utilization: When checkpointing is enabled, any data written to the checkpointing directory will be uploaded to the cloud by a separate process, allowing you to maximize GPU time spent training.
  • Seamless checkpoint management: Checkpoints are automatically uploaded to cloud storage for easy access and management.

Enabling checkpointing

To enable checkpointing, add a CheckpointingConfig to the Runtime and set enabled to True:
from truss_train import definitions

training_runtime = definitions.Runtime(
    # ... other configuration options
    checkpointing_config=definitions.CheckpointingConfig(enabled=True)
)

Using the checkpoint directory

Baseten will automatically export the $BT_CHECKPOINT_DIR environment variable in your job’s environment.
Write your checkpoints to the $BT_CHECKPOINT_DIR directory so Baseten can automatically backup and preserve them.

Browsing checkpoints

Use the CLI to list and interactively explore checkpoint files for a job:
truss train checkpoints list --job-id abc123
In interactive mode, you can fuzzy-search checkpoints, navigate their directory tree, and inspect file contents, including tensor summaries for .safetensors files. See checkpoints list for all options.

Resume training from a checkpoint

To resume training from a saved checkpoint or initialize a new job from a previous run, configure a LoadCheckpointConfig on the Runtime. Baseten downloads the referenced checkpoints into $BT_LOAD_CHECKPOINT_DIR before your start_commands run:
from truss_train import definitions

load_checkpoint_config = definitions.LoadCheckpointConfig(
    enabled=True,
    checkpoints=[
        definitions.BasetenCheckpoint.from_latest_checkpoint(
            project_name="my-training-project",
        ),
    ],
)
For more information, see loading checkpoints.

Serving checkpoints

Serve your model checkpoints using Baseten’s serving infrastructure. Reference training checkpoints in your weights configuration using the bt:// URI scheme. See serving checkpoints for deployment details.
When you delete a job or project, all undeployed checkpoints are permanently deleted with no archival or recovery option. Deployed checkpoints aren’t affected. See Management for details.