Organizing Your Work with TrainingProject
s
A TrainingProject
is a lightweight organization tool to help you group different TrainingJob
s together.
While there a few technical details to consider, your team can use TrainingProject
s to facilitate collaboration and organization.
Running a TrainingJob
Once you have a TrainingProject
, the actual work of training a model happens within a TrainingJob
. Each TrainingJob
represents a single, complete execution of your training script with a specific configuration.
- What it is: A
TrainingJob
is the fundamental unit of execution. It bundles together:- Your training code.
- A base
image
. - The
compute
resources needed to run the job. - The
runtime
configurations like startup commands and environment variables.
- Why use it: Each job is a self-contained, reproducible experiment. If you want to try training your model with a different learning rate, more GPUs, or a slightly modified script, you can create new
TrainingJob
s while knowing that previous ones have been persisted on Baseten. - Lifecycle: A job goes through various stages, from being created (
TRAINING_JOB_CREATED
), to resources being set up (TRAINING_JOB_DEPLOYING
), to actively running your script (TRAINING_JOB_RUNNING
), and finally to a terminal state likeTRAINING_JOB_COMPLETED
. More details on the job lifecycle can be found on the Lifecycle page.
Iterate Faster with the Training Cache
The training cache enables you to persist data between training jobs. This can significantly improve iteration speed by skipping expensive downloads and data transformations.- How to use it: Set the cache configuration in your
Runtime
- Cache Directory: The cache will be mounted at
/root/.cache/user_artifacts
, which can be accessed via the$BT_RW_CACHE_DIR
environment variable. - Legacy HF Cache: We recommend using the new cache directory at
/root/.cache/user_artifacts
instead. However, if you need to access data mounted to/root/.cache/huggingface
for compatibility reasons, you can setenable_legacy_hf_cache=True
in yourCacheConfig
. Note that this legacy option is not recommended for new projects. - Seeding Your Data: For multi-gpu training, you should ensure that your data is seeded before running multi-process training jobs. You can do this by separating your training script into training script and data loading script.
- Speedup: For a 400 GB HF Dataset, you can expect to save nearly an hour of compute time for each job - data download and preparation have been done already!
truss train cache summarize <project_name or project_id>
. This visibility into what’s in the cache can help you verify
your code is working as expected, and additionally manage files and artifacts you no longer need.
Manage checkpoints seemlessly with Baseten checkpointing.
With checkpointing enabled, you can- Avoid catastrophic out of disk errors: We mount additional storage at the checkpointing directory to help avoid out of disk errors during your training run.
- Maximize GPU utilization: When checkpointing is enabled, any data written to the to the checkpointing directory will be uploaded to the cloud by a separate process, allowing you to maximize gpu time spent training.
- Easily deploy for inference: Deploy checkpoints using our
deploy_checkpoints
CLI wizard.
CheckpointingConfig
to the Runtime
and set enabled
to True
$BT_CHECKPOINT_DIR
environment variable in your job’s environment. Ensure your code is writing checkpoints to the $BT_CHECKPOINT_DIR
.
Multinode Training
Baseten Training supports multinode training via infiniband. To deploy a multinode training job:- Configure the
Compute
resource in yourTrainingJob
by setting thenode_count
to the number of nodes you’d like to use (e.g. 2).
- Make sure you’ve properly integrated with the Baseten provided environment variables.
Securely Integrate with External Services with SecretReference
Successfully training a model often requires many tools and services. Baseten provides SecretReference
for secure handling of secrets.
- How to use it: Store your secret (e.g., an API key for Weights & Biases) in your Baseten workspace with a specific name. In your job’s configuration (e.g., environment variables), you refer to this secret by its name using
SecretReference
. The actual secret value is never exposed in your code. - How it works: Baseten injects the secret value at runtime under the environment variable name that you specify.
Running Inference on Trained Models
The journey from training to a usable model in Baseten typically follows this path:- A
TrainingJob
with checkpointing enabled, produces one or more model artifacts. - You run
truss train deploy_checkpoint
to deploy a model from your most recent training job. You can read more about this at Deploying Trained Models. - Once deployed, your model will be available for inference via API. See more at Calling Your Model.