This guide will walk you through the initial setup and the process of submitting your first TrainingJob using Baseten Training. In this demo, we’ll create a finetuned revision of OpenAI’s gpt-oss-20b!

Prerequisites

Before you begin, ensure you have the following:
  1. Baseten Account: You’ll need an active Baseten account. If you don’t have one, please sign up on the Baseten web app.
  2. API Key: Obtain an API key for your Baseten account. This key is required to authenticate with the Baseten API and SDK.
  3. Truss SDK and CLI: The truss package provides a python-native way for defining and running your training jobs. jobs. The CLI provides a convenient way to deploy and manage your training jobs. Install or update it:
    pip install -U truss 
    
  4. Dependencies: In this demo, we’ll use Huggingface to access and upload models. It’s recommended that you create a Huggingface access token and add it to your Baseten Secrets. Additionally, it can be helpful to visualize your training run. In this example, we use Weights & Biases (wandb). This is optional.

Step 1: Define your training configuration

The primary way to define your training jobs is through a Python configuration file, typically named config.py. This file uses the truss package to specify all aspects of your TrainingProject and TrainingJob. A simple example of a config.py file is shown below:
config.py
# Import necessary classes from the Baseten Training SDK
from truss_train import definitions
from truss.base import truss_config

# 1. Define a base image for your training job. You can also use
# private images via AWS IAM or GCP Service Account authentication.
BASE_IMAGE = "pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"

# 2. Define the Runtime Environment for the Training Job
# This includes start commands and environment variables.
# Secrets from the baseten workspace like API keys are referenced using 
# `SecretReference`.
training_runtime = definitions.Runtime(
    start_commands = [
        "chmod +x ./run.sh && ./run.sh",
    ],
    environment_variables={
        "HF_TOKEN": definitions.SecretReference(name="hf_access_token"), # The name of the HF Access Token secret in your B10 account
        "HF_HOME": "/root/.cache/user_artifacts/hf_cache"
        # Uncomment to export your wandb api key.
        # "WANDB_API_KEY" : definitions.SecretReference(name="wandb_api_key"),
    },
    cache_config=definitions.CacheConfig(
        enabled=True,
    )
)

# 3. Define the Compute Resources for the Training Job
training_compute = definitions.Compute(
    accelerator=truss_config.AcceleratorSpec(
        accelerator=truss_config.Accelerator.H100,  
        count=1,  
    ),
)

# 4. Define the Training Job
# This brings together the image, compute, and runtime configurations.
training_job = definitions.TrainingJob(
    image=definitions.Image(base_image=BASE_IMAGE),
    compute=training_compute,
    runtime=training_runtime
)


# This config will be pushed using the Truss CLI.
# The association of the job to the project happens at the time of push.
training_project = definitions.TrainingProject(
    name="LoRA Training Job - gpt-oss-20b",
    job=training_job
)

Key considerations for your Baseten training configuration file

  • Local Artifacts: If your training requires local scripts (like a train.py or a run.sh), helper files, or configuration files (e.g., accelerate config), place them in the same directory as your config.py or in subdirectories. When you push the training job, truss will package these artifacts and upload them. They will be copied into the container at the root of the base image’s working directory.
  • Ignore Folders and Files: You can exclude specific files from being pushed by creating a .truss_ignore file in root directory of your project. In this file, you can add entries in a style similar to .gitignore. Refer to the CLI reference for more details.
  • Secrets: Ensure any secrets referenced via SecretReference (e.g., hf_access_token, wandb_api_key) are defined in your Baseten workspace settings.
  • Private Images: You can deploy your jobs with private images by specifying a DockerAuth in your Image configuration. See our DockerAuth SDK for more details.
For a complete guide on the TrainingJob type, check out our SDK-reference.

What can I run in the start_commands?

In short, anything! Baseten Training is a framework-agnostic training platform. Any training framework and training methodology is supported. Typically, a run.sh script is used. An example might look like this:
run.sh
#!/bin/bash

# Exit immediately if a command exits with a non-zero status
set -eux

# Install dependencies
pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0" 
# Uncomment to enable wandb
# pip install wandb

# Let's run! 
python3 train.py
To complete the example, we provide the train.py below.
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Mxfp4Config
import torch

from trl import SFTConfig, SFTTrainer

## TODO: update your dataset here
dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split="train")


tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")


quantization_config = Mxfp4Config(dequantize=True)
model_kwargs = dict(
    attn_implementation="eager",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config,
    use_cache=False,
    device_map="auto",
)

model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b", **model_kwargs)

from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules="all-linear",
    target_parameters=[
        "7.mlp.experts.gate_up_proj",
        "7.mlp.experts.down_proj",
        "15.mlp.experts.gate_up_proj",
        "15.mlp.experts.down_proj",
        "23.mlp.experts.gate_up_proj",
        "23.mlp.experts.down_proj",
    ],
)
peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

training_args = SFTConfig(
    learning_rate=2e-4,
    gradient_checkpointing=True,
    num_train_epochs=0.3,
    logging_steps=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    max_length=2048,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine_with_min_lr",
    lr_scheduler_kwargs={"min_lr_rate": 0.1},
    output_dir="gpt-oss-20b-multilingual-reasoner",
    push_to_hub=False,
)

trainer = SFTTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()

trainer.save_model(training_args.output_dir)
# Push the trained model in output_dir to a Hugging Face model repo
trainer.push_to_hub("baseten-admin/gpt-oss-20b-multilingual-reasoner")

Trainig Different Models

This recipe and more can be found at Baseten’s ML Cookbook. Clone the repo to get the starter code for this demo, along with other training and finetuning examples!

Additional features

We’ve kept the above config simple to help you get off the ground - but there’s a lot more you can do Baseten Training:
  • Checkpointing - automatically save and deploy your model checkpoints.
  • Training Cache - speed up training by caching data and models between jobs.
  • Multinode - train on multiple GPU nodes to make the most out of your compute.

Step 2: Submit Your Training Job

Once your config.py and any local artifacts are ready, you submit the training job using the truss CLI:
truss train push config.py
This command does the following:
  1. Parses your config.py.
  2. Packages any local files in the directory (and subdirectories) alongside config.py.
  3. Creates or updates the TrainingProject specified in your config.
  4. Submits the defined TrainingJob under that project.
Upon successful submission, the CLI will print out a training job id with some helpful commands. You can also navigate to the Baseten Web UI to view your logs and metrics: https://app.baseten.co/training/

Next steps

  • Core Concepts: Deepen your understanding of Baseten Training and explore key features like CheckpointingConfig, Training Cache, and Multinode.
  • Management: Learn how to check status, view logs and metrics, and stop jobs.