Organizing Your Work with TrainingProject
s
A TrainingProject
is a lightweight organization tool to help you group different TrainingJob
s together.
While there a few technical details to consider, your team can use TrainingProject
s to facilitate collaboration and organization.
Running a TrainingJob
Once you have a TrainingProject
, the actual work of training a model happens within a TrainingJob
. Each TrainingJob
represents a single, complete execution of your training script with a specific configuration.
- What it is: A
TrainingJob
is the fundamental unit of execution. It bundles together:- Your training code.
- A base
image
. - The
compute
resources needed to run the job. - The
runtime
configurations like startup commands and environment variables.
- Why use it: Each job is a self-contained, reproducible experiment. If you want to try training your model with a different learning rate, more GPUs, or a slightly modified script, you can create new
TrainingJob
s while knowing that previous ones have been persisted on Baseten. - Lifecycle: A job goes through various stages, from being created (
TRAINING_JOB_CREATED
), to resources being set up (TRAINING_JOB_DEPLOYING
), to actively running your script (TRAINING_JOB_RUNNING
), and finally to a terminal state likeTRAINING_JOB_COMPLETED
. More details on the job lifecycle can be found on the Lifecycle page.
Compute Resources
TheCompute
configuration defines the computational resources your training job will use. This includes:
- GPU specifications - Choose from various GPU types based on your model’s requirements
- CPU and memory - Configure the amount of CPU and RAM allocated to your job
- Node count - For single-node or multi-node training setups
Base Images
Baseten provides pre-configured base images that include common ML frameworks and dependencies. These images are optimized for training workloads and include:- Popular ML frameworks (PyTorch, VERL, Megatron-LM, etc.)
- GPU drivers and CUDA support
- Common data science libraries
- Baseten’s training SDK
Securely Integrate with External Services with SecretReference
Successfully training a model often requires many tools and services. Baseten provides SecretReference
for secure handling of secrets.
- How to use it: Store your secret (e.g., an API key for Weights & Biases) in your Baseten workspace with a specific name. In your job’s configuration (e.g., environment variables), you refer to this secret by its name using
SecretReference
. The actual secret value is never exposed in your code. - How it works: Baseten injects the secret value at runtime under the environment variable name that you specify.
Running Inference on Trained Models
The journey from training to a usable model in Baseten typically follows this path:- A
TrainingJob
with checkpointing enabled, produces one or more model artifacts. - You run
truss train deploy_checkpoint
to deploy a model from your most recent training job. You can read more about this at Serving Trained Models. - Once deployed, your model will be available for inference via API. See more at Calling Your Model.