This guide will walk you through the initial setup and the process of submitting your first TrainingJob using Baseten Training.

Prerequisites

Before you begin, ensure you have the following:

  1. Baseten Account: You’ll need an active Baseten account. If you don’t have one, please sign up on the Baseten web app.
  2. API Key: Obtain an API key for your Baseten account. This key is required to authenticate with the Baseten API and SDK.
  3. Truss SDK and CLI: The truss package provides a python-native way for defining and running your training jobs. jobs. The CLI provides a convenient way to deploy and manage your training jobs. Install or update it:
    pip install -U truss 
    

Step 1: Define your Training Configuration

The primary way to define your training jobs is through a Python configuration file, typically named config.py. This file uses the truss package to specify all aspects of your TrainingProject and TrainingJob.

A simple example of a config.py file is shown below:

config.py
# Import necessary classes from the Baseten Training SDK
from truss_train import definitions
from truss.base import truss_config

# 1. Define a base image for your training job
BASE_IMAGE = "axolotlai/axolotl:main-20250324-py3.11-cu124-2.6.0"

# 2. Define the Runtime Environment for the Training Job
# This includes start commands and environment variables.
# Secrets from the baseten workspace like API keys are referenced using 
# `SecretReference`.
training_runtime = definitions.Runtime(
    start_commands=[ # Example: list of commands to run your training script
        # "pip install -r requirements.txt", # pip install requirements on top of base image
        "/bin/sh -c './run.sh'",  
    ],
    environment_variables={
        # Secrets (ensure these are configured in your Baseten workspace)
        "HF_TOKEN": definitions.SecretReference(name="hf_access_token"),
        "WANDB_API_KEY" : definitions.SecretReference(name="wandb_api_key"),
        "HELLO": "WORLD"
    },
)

# 3. Define the Compute Resources for the Training Job
training_compute = definitions.Compute(
    accelerator=truss_config.AcceleratorSpec(
        accelerator=truss_config.Accelerator.H100,  
        count=4,  
    ),
)

# 4. Define the Training Job
# This brings together the image, compute, and runtime configurations.
my_training_job = definitions.TrainingJob(
    image=definitions.Image(base_image=BASE_IMAGE),
    compute=training_compute,
    runtime=training_runtime
)


# This config will be pushed using the Truss CLI.
# The association of the job to the project happens at the time of push.
first_project_with_job = definitions.TrainingProject(
    name=project_name,
    job=my_training_job
)

Key considerations for your Baseten Training configuration file

  • Local Artifacts: If your training requires local scripts (like a train.py or a run.sh), helper files, or configuration files (e.g., accelerate config), place them in the same directory as your config.py or in subdirectories. When you push the training job, truss will package these artifacts and upload them. They will be copied into the container at the root of the base image’s working directory.
  • Secrets: Ensure any secrets referenced via SecretReference (e.g., hf_access_token, wandb_api_key) are defined in your Baseten workspace settings.

For a complete guide on the TrainingJob type, check out our SDK-reference.

What can I run in the start_commands?

In short, anything! Baseten Training is a framework-agnostic training platform. Any training framework and training methodology is supported. Typically, a run.sh script is used. An example might look like this:

run.sh
#!/bin/bash

# Exit immediately if a command exits with a non-zero status
set -e

# Install dependencies
pip install -r requirements.txt

# authenticate with wandb
wandb login $WANDB_API_KEY # defined via Runtime.EnvironmentVariables

# download models and datasets
huggingface-cli download google/gemma-3-27b-it
huggingface-cli download Abirate/english_quotes --repo-type dataset

# Run training
accelerate launch --config_file config.yml --num_processes $BT_NUM_GPUS train.py

Additional features

We’ve kept the above config simple to help you get off the ground - but there’s a lot more you can do Baseten Training:

  • Checkpointing - automatically save and deploy your model checkpoints.
  • Training Cache - speed up training by caching data and models between jobs.
  • Multinode - train on multiple GPU nodes to make the most out of your compute.

Step 2: Submit Your Training Job

Once your config.py and any local artifacts are ready, you submit the training job using the truss CLI:

truss train push config.py

This command does the following:

  1. Parses your config.py.
  2. Packages any local files in the directory (and subdirectories) alongside config.py.
  3. Creates or updates the TrainingProject specified in your config.
  4. Submits the defined TrainingJob under that project.

Upon successful submission, the CLI will output helpful information about your job:

✨ Training job successfully created!
🪵 View logs for your job via `truss train logs --job-id e3m512w [--tail]`
🔍 View metrics for your job via `truss train metrics --job-id e3m512w`

Keep the Job ID handy, as you’ll use it for managing and monitoring your job.

Next Steps

  • Core Concepts: Deepen your understanding of Baseten Training and explore key features like CheckpointingConfig, Training Cache, and Multinode.
  • Management: Learn how to check status, view logs and metrics, and stop jobs.