Training on Baseten

Baseten provides a flexible training platform that enables you to bring your own training scripts, use the latest training techniques, and fine-tune the newest models. Train models and serve them in production, all on one platform. Baseten automatically stores your checkpoints during training and makes them ready for deployment. You don’t need to download weights, re-upload them, or manage separate infrastructure. Your fine-tuned model goes from checkpoint to production endpoint in a single command. The core workflow requires just two commands:

# Train your model
truss train push config.py

# Deploy from the checkpoint
truss train deploy_checkpoints --job-id <job_id>

Train and serve on one platform

The train-to-serve workflow is seamless:

Set up your training project: Bring any framework or start with a template.
Configure your training job: Define compute, runtime, and checkpointing settings.
Run on managed infrastructure: Use H200 or H100 GPUs, single-node or multi-node.
Checkpoints sync automatically: Baseten stores checkpoints as training progresses.
Deploy your fine-tuned model: Go from checkpoint to production endpoint in one command.

Baseten handles infrastructure management and file transfers. Bring any framework (Axolotl, TRL, VeRL, Megatron, or your own training code) and your trained model serves traffic within minutes of training completion.

Supported frameworks

Baseten Training is framework-agnostic. Use whatever framework fits your workflow.

Framework	Best for	Example
Axolotl	Configuration-driven fine-tuning with LoRA/QLoRA	oss-gpt-20b-axolotl
TRL	SFT, DPO, and GRPO with Hugging Face	oss-gpt-20b-lora-trl
TRL	LoRA DPO fine-tuning	qwen3-8b-lora-dpo-trl
VeRL	Reinforcement learning with custom rewards	qwen3-8b-lora-verl
MS-Swift	Long-context and multilingual training	qwen3-30b-mswift-multinode

Browse the ML Cookbook for more examples including multi-node training with FSDP and DeepSpeed.

Key features

Checkpoint management

Checkpoints sync automatically to Baseten storage during training. You can:

Deploy any checkpoint as a production endpoint with truss train deploy_checkpoints.
Download checkpoints for local evaluation and analysis.
Resume from any checkpoint if a job fails or you want to train further.

Learn more about checkpointing.

BDN weight and data loading

Load model weights and training data through Baseten Delivery Network (BDN). Mount weights from Hugging Face, S3, GCS, Azure, R2, or any HTTPS URL directly into your training container with no download code needed. BDN mirrors weights before compute is provisioned, then caches them for faster mounting on subsequent jobs. See storage and data ingestion for setup details.

Persistent caching

Speed up training iterations by caching models, datasets, and preprocessed data between jobs. The cache persists across training runs, so you don’t re-download 70B models every time. See the training cache guide for configuration options.

Multi-node training

Scale training across multiple GPU nodes with InfiniBand networking. Baseten handles node orchestration, communication setup, and environment variables. You just set node_count in your configuration. Learn more about multi-node training.

Remote access

Connect to running training containers to debug, inspect state, and iterate without resubmitting. Baseten offers two options:

SSH: Connect from any OpenSSH client for terminal sessions and file transfer with scp or sftp.
rSSH (interactive sessions): Connect from VS Code or Cursor Remote Tunnels for a full IDE experience.

See the Remote access overview to choose between them.

Next steps

Get started

Run your first training job and deploy the result.

ML Cookbook

Production-ready examples for various frameworks and models.

Get started

About Baseten

Inference

Development

Deployment

Engines

Training

Organization

Observability

Troubleshooting

Training on Baseten

Train and serve on one platform

Supported frameworks

Key features

Checkpoint management

BDN weight and data loading

Persistent caching

Multi-node training

Remote access

Next steps

Get started

ML Cookbook

Reference

Get started

About Baseten

Inference

Development

Deployment

Engines

Training

Organization

Observability

Troubleshooting

Documentation Index

​Train and serve on one platform

​Supported frameworks

​Key features

​Checkpoint management

​BDN weight and data loading

​Persistent caching

​Multi-node training

​Remote access

​Next steps

Get started

ML Cookbook

​Reference

Train and serve on one platform

Supported frameworks

Key features

Checkpoint management

BDN weight and data loading

Persistent caching

Multi-node training

Remote access

Next steps

Reference