Train and serve on one platform
The train-to-serve workflow is seamless:- Set up your training project: Bring any framework or start with a template.
- Configure your training job: Define compute, runtime, and checkpointing settings.
- Run on managed infrastructure: Use H100, H200, or A10G GPUs—single-node or multi-node.
- Checkpoints sync automatically: Baseten stores checkpoints as training progresses.
- Deploy your fine-tuned model: Go from checkpoint to production endpoint in one command.
Supported frameworks
Baseten Training is framework-agnostic. Use whatever framework fits your workflow.| Framework | Best for | Example |
|---|---|---|
| Axolotl | Configuration-driven fine-tuning with LoRA/QLoRA | oss-gpt-20b-axolotl |
| TRL | SFT, DPO, and GRPO with Hugging Face | oss-gpt-20b-lora-trl |
| Unsloth | Fast single-GPU LoRA training | llama-8b-lora-unsloth |
| VeRL | Reinforcement learning with custom rewards | qwen3-8b-lora-verl |
| MS-Swift | Long-context and multilingual training | qwen3-30b-mswift-multinode |
Key features
Checkpoint management
Checkpoints sync automatically to Baseten storage during training. You can:- Deploy any checkpoint as a production endpoint with
truss train deploy_checkpoints. - Download checkpoints for local evaluation and analysis.
- Resume from any checkpoint if a job fails or you want to train further.
Persistent caching
Speed up training iterations by caching models, datasets, and preprocessed data between jobs. The cache persists across training runs, so you don’t re-download 70B models every time. Learn more about the training cache.Multi-node training
Scale training across multiple GPU nodes with InfiniBand networking. Baseten handles node orchestration, communication setup, and environment variables—you just setnode_count in your configuration.
Learn more about multi-node training.
Interactive development with rSSH
Debug training jobs interactively with SSH-like access to your training environment. Rapidly experiment, inspect state, and iterate without losing reproducibility. Learn more about rSSH.Next steps
Get started
Run your first training job and deploy the result.
ML Cookbook
Production-ready examples for various frameworks and models.