Skip to main content
Train models and serve them in production, all on one platform. Baseten automatically stores your checkpoints during training and makes them ready for deployment. No downloading weights, no re-uploading, no separate infrastructure. Your fine-tuned model goes from checkpoint to production endpoint in a single command.
# Train your model
truss train push config.py

# Deploy from the checkpoint
truss train deploy_checkpoints --job-id <job_id>

Train and serve on one platform

The train-to-serve workflow is seamless:
  1. Set up your training project: Bring any framework or start with a template.
  2. Configure your training job: Define compute, runtime, and checkpointing settings.
  3. Run on managed infrastructure: Use H100, H200, or A10G GPUs—single-node or multi-node.
  4. Checkpoints sync automatically: Baseten stores checkpoints as training progresses.
  5. Deploy your fine-tuned model: Go from checkpoint to production endpoint in one command.
No infrastructure management. No manual file transfers. Bring any framework—Axolotl, TRL, VeRL, Megatron, or your own training code—and your trained model serves traffic within minutes of training completion.

Supported frameworks

Baseten Training is framework-agnostic. Use whatever framework fits your workflow.
FrameworkBest forExample
AxolotlConfiguration-driven fine-tuning with LoRA/QLoRAoss-gpt-20b-axolotl
TRLSFT, DPO, and GRPO with Hugging Faceoss-gpt-20b-lora-trl
UnslothFast single-GPU LoRA trainingllama-8b-lora-unsloth
VeRLReinforcement learning with custom rewardsqwen3-8b-lora-verl
MS-SwiftLong-context and multilingual trainingqwen3-30b-mswift-multinode
Browse the ML Cookbook for more examples including multi-node training with FSDP and DeepSpeed.

Key features

Checkpoint management

Checkpoints sync automatically to Baseten storage during training. You can:
  • Deploy any checkpoint as a production endpoint with truss train deploy_checkpoints.
  • Download checkpoints for local evaluation and analysis.
  • Resume from any checkpoint if a job fails or you want to train further.
Learn more about checkpointing.

Persistent caching

Speed up training iterations by caching models, datasets, and preprocessed data between jobs. The cache persists across training runs, so you don’t re-download 70B models every time. Learn more about the training cache.

Multi-node training

Scale training across multiple GPU nodes with InfiniBand networking. Baseten handles node orchestration, communication setup, and environment variables—you just set node_count in your configuration. Learn more about multi-node training.

Interactive development with rSSH

Debug training jobs interactively with SSH-like access to your training environment. Rapidly experiment, inspect state, and iterate without losing reproducibility. Learn more about rSSH.

Next steps

Reference