Skip to main content
This page covers the essential building blocks of Baseten Training. These are the core concepts you’ll need to understand to effectively organize and execute your training workflows.

Organizing Your Work with TrainingProjects

A TrainingProject is a lightweight organization tool to help you group different TrainingJobs together. While there a few technical details to consider, your team can use TrainingProjects to facilitate collaboration and organization.

Running a TrainingJob

Once you have a TrainingProject, the actual work of training a model happens within a TrainingJob. Each TrainingJob represents a single, complete execution of your training script with a specific configuration.
  • What it is: A TrainingJob is the fundamental unit of execution. It bundles together:
    • Your training code.
    • A base image.
    • The compute resources needed to run the job.
    • The runtime configurations like startup commands and environment variables.
  • Why use it: Each job is a self-contained, reproducible experiment. If you want to try training your model with a different learning rate, more GPUs, or a slightly modified script, you can create new TrainingJobs while knowing that previous ones have been persisted on Baseten.
  • Lifecycle: A job goes through various stages, from being created (TRAINING_JOB_CREATED), to resources being set up (TRAINING_JOB_DEPLOYING), to actively running your script (TRAINING_JOB_RUNNING), and finally to a terminal state like TRAINING_JOB_COMPLETED. More details on the job lifecycle can be found on the Lifecycle page.

Compute Resources

The Compute configuration defines the computational resources your training job will use. This includes:
  • GPU specifications - Choose from various GPU types based on your model’s requirements
  • CPU and memory - Configure the amount of CPU and RAM allocated to your job
  • Node count - For single-node or multi-node training setups

Base Images

Baseten provides pre-configured base images that include common ML frameworks and dependencies. These images are optimized for training workloads and include:
  • Popular ML frameworks (PyTorch, VERL, Megatron-LM, etc.)
  • GPU drivers and CUDA support
  • Common data science libraries
  • Baseten’s training SDK
You can also use custom or private images if you have specific requirements.

Securely Integrate with External Services with SecretReference

Successfully training a model often requires many tools and services. Baseten provides SecretReference for secure handling of secrets.
  • How to use it: Store your secret (e.g., an API key for Weights & Biases) in your Baseten workspace with a specific name. In your job’s configuration (e.g., environment variables), you refer to this secret by its name using SecretReference. The actual secret value is never exposed in your code.
  • How it works: Baseten injects the secret value at runtime under the environment variable name that you specify.
from truss_train import definitions

runtime = definitions.Runtime(
    # ... other runtime options
    environment_variables={
        "HF_TOKEN": definitions.SecretReference(name="hf_access_token"),
    },
)

Running Inference on Trained Models

The journey from training to a usable model in Baseten typically follows this path:
  1. A TrainingJob with checkpointing enabled, produces one or more model artifacts.
  2. You run truss train deploy_checkpoint to deploy a model from your most recent training job. You can read more about this at Serving Trained Models.
  3. Once deployed, your model will be available for inference via API. See more at Calling Your Model.
I