Concepts

Baseten Training is designed to provide a structured yet flexible way to manage your machine learning training workflows. To use it effectively, it helps to understand the main ideas behind its components and how they fit together. This isn’t an API reference, but rather a guide to thinking about how to organize and execute your training tasks.

Organizing Your Work with `TrainingProject`s

A TrainingProject is a lightweight organization tool to help you group different TrainingJobs together. While there a few technical details to consider, your team can use TrainingProjects to facilitate collaboration and organization.

Running a `TrainingJob`

Once you have a TrainingProject, the actual work of training a model happens within a TrainingJob. Each TrainingJob represents a single, complete execution of your training script with a specific configuration.

What it is: A TrainingJob is the fundamental unit of execution. It bundles together:
- Your training code.
- A base image.
- The compute resources needed to run the job.
- The runtime configurations like startup commands and environment variables.
Why use it: Each job is a self-contained, reproducible experiment. If you want to try training your model with a different learning rate, more GPUs, or a slightly modified script, you can create new TrainingJobs while knowing that previous ones have been persisted on Baseten.
Lifecycle: A job goes through various stages, from being created (TRAINING_JOB_CREATED), to resources being set up (TRAINING_JOB_DEPLOYING), to actively running your script (TRAINING_JOB_RUNNING), and finally to a terminal state like TRAINING_JOB_COMPLETED. More details on the job lifecycle can be found on the Lifecycle page.

Iterate Faster with the Training Cache

The training cache enables you to persist data between training jobs. This can significantly improve iteration speed by skipping expensive downloads and data transformations.

How to use it: Set the cache configuration in your Runtime

from truss_train import definitions

training_runtime = definitions.Runtime(
    # ... other configuration options
    cache_config=definitions.CacheConfig(enabled=True)
)

Cache Directory: The cache will be mounted at /root/.cache/user_artifacts, which can be accessed via the $BT_RW_CACHE_DIR environment variable.
Legacy HF Cache: We recommend using the new cache directory at /root/.cache/user_artifacts instead. However, if you need to access data mounted to /root/.cache/huggingface for compatibility reasons, you can set enable_legacy_hf_cache=True in your CacheConfig. Note that this legacy option is not recommended for new projects.
Seeding Your Data: For multi-gpu training, you should ensure that your data is seeded before running multi-process training jobs. You can do this by separating your training script into training script and data loading script.
Speedup: For a 400 GB HF Dataset, you can expect to save nearly an hour of compute time for each job - data download and preparation have been done already!

Taking Advantage of Automated Checkpointing

Training machine learning models can be lengthy and resource-intensive. Baseten’s automated Checkpointing provides seemless storage for checkpoints and a jumping off point for inference and eval.

What it is: Automated Checkpointing provides a seamless way to save model checkpoints to cloud storage.
Why use it:
- Fault Tolerance: Resume from the last saved checkpoint if a job fails, saving time and compute.
- Experimentation: Use saved checkpoints as starting points for new training runs with different hyperparameters or for transfer learning.
- Model Evaluation: Deploy intermediate model versions to track progress.

To enable checkpointing, add a CheckpointingConfig to the Runtime and set enabled to True

from truss_train import definitions

training_runtime = definitions.Runtime(
    # ... other configuration options
    checkpointing_config=definitions.CheckpointingConfig(enabled=True)
)

Baseten will automatically export the $BT_CHECKPOINT_DIR environment variable in your job’s environment. Ensure your code is writing checkpoints to the $BT_CHECKPOINT_DIR.

Multinode Training

Baseten Training supports multinode training via infiniband. To deploy a multinode training job:

Configure the Compute resource in your TrainingJob by setting the node_count to the number of nodes you’d like to use (e.g. 2).

from truss_train import definitions

compute = definitions.Compute(
    node_count=2,  # Use 2 nodes for multinode training
    # ... other compute configuration options
)

Make sure you’ve properly integrated with the Baseten provided environment variables.

Securely Integrate with External Services with `SecretReference`

Successfully training a model often requires many tools and services. Baseten provides SecretReference for secure handling of secrets.

How to use it: Store your secret (e.g., an API key for Weights & Biases) in your Baseten workspace with a specific name. In your job’s configuration (e.g., environment variables), you refer to this secret by its name using SecretReference. The actual secret value is never exposed in your code.
How it works: Baseten injects the secret value at runtime under the environment variable name that you specify.

from truss_train import definitions

runtime = definitions.Runtime(
    # ... other runtime options
    environment_variables={
        "HF_TOKEN": definitions.SecretReference(name="hf_access_token"),
    },
)

Running Inference on Trained Models

The journey from training to a usable model in Baseten typically follows this path:

A TrainingJob with checkpointing enabled, produces one or more model artifacts.
You run truss train deploy_checkpoint to deploy a model from your most recent training job. You can read more about this at Deploying Trained Models.
Once deployed, your model will be available for inference via API. See more at Calling Your Model.

Get started

Development

Deployment

Inference

Training

Observability

Troubleshooting

Concepts

Organizing Your Work with `TrainingProject`s

Running a `TrainingJob`

Iterate Faster with the Training Cache

Taking Advantage of Automated Checkpointing

Multinode Training

Securely Integrate with External Services with `SecretReference`

Running Inference on Trained Models

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

​Organizing Your Work with TrainingProjects

​Running a TrainingJob

​Iterate Faster with the Training Cache

​Taking Advantage of Automated Checkpointing

​Multinode Training

​Securely Integrate with External Services with SecretReference

​Running Inference on Trained Models

Organizing Your Work with `TrainingProject`s

Running a `TrainingJob`

Iterate Faster with the Training Cache

Taking Advantage of Automated Checkpointing

Multinode Training

Securely Integrate with External Services with `SecretReference`

Running Inference on Trained Models