Skip to main content
Learn to fine-tune a model on Baseten, monitor your training job, and deploy the result as an endpoint.

Prerequisites

Before you begin, ensure you have:
  • Baseten account: Sign up at app.baseten.co.
  • Truss: Install Truss:
You can add keys like hf_access_token or wandb_api_key in Baseten Secrets to access gated models on Hugging Face or track experiment metrics in Weights & Biases.

Create your training project

To create a new training project, use the truss train init command.
truss train init --examples oss-gpt-20b-axolotl
cd oss-gpt-20b-axolotl
Baseten provides starter templates for common frameworks like Axolotl, Unsloth, and TRL. Browse the ML Cookbook for more examples. Skip to Submit your training job if you’re using the template.

Write your configuration file

Define your training job in a Python configuration file, typically named config.py. This file specifies your TrainingProject and TrainingJob. The configuration uses classes like Image, Compute, Runtime, and SecretReference:
config.py
from truss_train import (
    TrainingProject,
    TrainingJob,
    Image,
    Compute,
    Runtime,
    SecretReference,
    CacheConfig,
    CheckpointingConfig,
)
from truss.base.truss_config import AcceleratorSpec

# Base image with your training dependencies (use any Docker image)
BASE_IMAGE = "pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"

# Runtime configuration
training_runtime = Runtime(
    start_commands=[
        "chmod +x ./run.sh && ./run.sh",
    ],
    environment_variables={
        # "HF_TOKEN": SecretReference(name="hf_access_token"),
        # "WANDB_API_KEY": SecretReference(name="wandb_api_key"),
    },
    cache_config=CacheConfig(enabled=True),
    checkpointing_config=CheckpointingConfig(enabled=True),
)

# Compute resources
training_compute = Compute(
    accelerator=AcceleratorSpec(accelerator="H100", count=2),
)

# Training job definition
training_job = TrainingJob(
    image=Image(base_image=BASE_IMAGE),
    compute=training_compute,
    runtime=training_runtime,
)

# Project groups related training jobs together
training_project = TrainingProject(
    name="LoRA Training Job - gpt-oss-20b",
    job=training_job
)
This example uses the pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime base image. You can use other base images to support your framework:
FrameworkBase image
PyTorchpytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime
Axolotlaxolotlai/axolotl:main-20250811-py3.11-cu126-2.7.1
Unslothunsloth/unsloth:2025.10.9-pt2.8.0-cu12.8-updates-fixes
VeRLverlai/verl:verl0.3.0.post1
Megatronbaseten/megatron:py3.11.11-cuda12.8.1-torch2.8.0-fa2.8.1-megatron0.14.1-msswift3.10.3
For information on using private images, see the Training SDK reference. To configure your project, you should set:
  • Local artifacts: Place scripts (train.py, run.sh), config files, and data in the same directory as config.py. Truss packages everything and uploads it to the container’s working directory.
  • Ignore files: Create a .truss_ignore file to exclude files from upload, using .gitignore syntax. For more information, see the Training reference.
  • Secrets: Store secrets in your Baseten workspace and reference them with SecretReference.

Create your training scripts

Baseten Training is framework-agnostic. Typically, you use a run.sh script to install dependencies and launch training. For example:
run.sh
#!/bin/bash
set -eux

# Install dependencies
pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"

# Run training
python3 train.py
Here’s a corresponding train.py:
train.py
import os
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Mxfp4Config
import torch
from peft import LoraConfig, get_peft_model
from trl import SFTConfig, SFTTrainer

MODEL_ID = "openai/gpt-oss-20b"
DATASET_ID = "HuggingFaceH4/Multilingual-Thinking"

# Load dataset and tokenizer
dataset = load_dataset(DATASET_ID, split="train")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Load model with quantization
quantization_config = Mxfp4Config(dequantize=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    attn_implementation="eager",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config,
    use_cache=False,
    device_map="auto",
)

# Configure LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules="all-linear",
    target_parameters=[
        "7.mlp.experts.gate_up_proj",
        "7.mlp.experts.down_proj",
        "15.mlp.experts.gate_up_proj",
        "15.mlp.experts.down_proj",
        "23.mlp.experts.gate_up_proj",
        "23.mlp.experts.down_proj",
    ],
)
peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

# Training configuration
training_args = SFTConfig(
    learning_rate=2e-4,
    gradient_checkpointing=True,
    num_train_epochs=1,
    logging_steps=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    max_length=2048,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine_with_min_lr",
    lr_scheduler_kwargs={"min_lr_rate": 0.1},
    output_dir=os.getenv("BT_CHECKPOINT_DIR", "./checkpoints"),
)

# Train
trainer = SFTTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()

# Save to checkpoint directory for deployment
trainer.save_model(training_args.output_dir)

Submit your training job

With your config.py and training scripts ready, submit the job:
truss train push config.py
This command parses your config.py file, packages local files in the directory alongside config.py, creates or updates the TrainingProject, and submits the TrainingJob. On successful submission, you’ll see:
✨ Training job successfully created!
🪵 View logs for your job via 'truss train logs --job-id <job_id> --tail'
🔍 View metrics for your job via 'truss train metrics --job-id <job_id>'
🌐 View job in the UI: https://app.baseten.co/training/<project_id>/logs/<job_id>
Copy the job_id from this output to use in the monitoring commands below.

Monitor your training job

Use the job ID from the submission output to monitor your training job:
truss train logs --job-id <job_id> --tail
You can also view logs, metrics, and job status in the Baseten dashboard. See Monitor and manage jobs for detailed monitoring commands including metrics, job status, and stopping jobs.

Deploy your trained model

Once you see the model saved successfully in your logs, you’re ready to deploy. For example, you might see:
[2026-01-01 12:00:00] [INFO] Model successfully saved to /workspace/checkpoints
Job has exited. Syncing checkpoints...
Deploy your fine-tuned model directly to Baseten’s inference platform:
truss train deploy_checkpoints
The interactive wizard guides you through deployment:
Fetching checkpoints for training job <job_id>...
? Use spacebar to select/deselect checkpoints to deploy.
  ○ .
❯ ○ checkpoint-15

? Enter the model name for your deployment: my-fine-tuned-model
? Select the GPU type to use for deployment: A100
? Select the number of A100 GPUs to use for deployment: 2
? Enter the huggingface secret name: hf_access_token

Successfully created model version: deployment-1
Model version ID: <model_version_id>

Test your deployment

After deployment, call your model using the OpenAI-compatible chat format:
truss predict --model <model-id> --data '{"model": "<checkpoint-name>", "messages": [{"role": "user", "content": "Hello!"}]}'

Training framework examples

FrameworkExampleDescription
TRLoss-gpt-20b-lora-trlLoRA fine-tuning
TRLqwen3-8b-lora-dpo-trlDPO training
Axolotloss-gpt-20b-axolotlAxolotl fine-tuning
Axolotlgemma-27b-axolotlMulti-node fine-tuning
Unslothllama-8b-lora-unslothFast LoRA fine-tuning
VeRLqwen3-8b-fft-verlRL with custom rewards
MS-Swiftglm-4-7-msswiftGLM-4 fine-tuning
MS-Swiftqwen3-235b-mswiftLarge model training
See the ML Cookbook for all examples and advanced recipes.