> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Training on Baseten

> Train custom models with developer-first training infrastructure on Baseten.

Baseten provides a flexible training platform that enables you to bring your own training scripts, use the latest training techniques, and fine-tune the newest models.

Train models and serve them in production, all on one platform. Baseten automatically stores your checkpoints during training and makes them ready for deployment. You don't need to download weights, re-upload them, or manage separate infrastructure. Your fine-tuned model goes from checkpoint to production endpoint in a single command.

The core workflow requires just two commands:

```bash theme={"system"}
# Train your model
truss train push config.py

# Deploy from the checkpoint
truss train deploy_checkpoints --job-id <job_id>
```

## Choosing between Truss Train and Loops

Baseten supports two training paths. **Truss Train** is the bring-your-own-container path documented on the rest of this page: package any training framework as a Truss, configure hardware, and run it on managed infrastructure.

**[Loops](/loops/overview)** is a Tinker-compatible managed path for fine-tuning and RL: write `import tinker` Python, and Baseten provisions a paired trainer and sampling server with weights that move live between them. If your team already uses Tinker, or wants SFT and RL without managing a training container, start with Loops.

|                      | [Loops](/loops/overview)                                                 | `truss train`                                    |
| -------------------- | ------------------------------------------------------------------------ | ------------------------------------------------ |
| Training code        | Tinker-compatible Python (`import tinker`)                               | Any container image                              |
| Infrastructure setup | None. Baseten provisions trainer and sampling servers.                   | You define hardware in a Truss config            |
| Checkpoint format    | Paginated presigned URLs, streamed to sampling server live               | Your container's output artifacts                |
| Inference path       | Automatic: checkpoint deploys directly to a Baseten inference deployment | Manual: you move artifacts and deploy separately |
| Session lifetime     | Open until you call `truss loops deactivate`                             | Job runs to completion and exits                 |
| Documentation        | [Loops](/loops/overview)                                                 | This section                                     |

## Train and serve on one platform

The train-to-serve workflow is seamless:

1. **Set up your training project:** Bring any framework or start with a template.
2. **Configure your training job:** Define compute, runtime, and checkpointing settings.
3. **Run on managed infrastructure:** Use H200 or H100 GPUs, single-node or multi-node.
4. **Checkpoints sync automatically:** Baseten stores checkpoints as training progresses.
5. **Deploy your fine-tuned model:** Go from checkpoint to production endpoint in one command.

Baseten handles infrastructure management and file transfers. Bring any framework (Axolotl, TRL, VeRL, Megatron, or your own training code) and your trained model serves traffic within minutes of training completion.

## Supported frameworks

Baseten Training is framework-agnostic. Use whatever framework fits your workflow.

| Framework | Best for                                         | Example                                                                                                                |
| --------- | ------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------- |
| Axolotl   | Configuration-driven fine-tuning with LoRA/QLoRA | [oss-gpt-20b-axolotl](https://github.com/basetenlabs/ml-cookbook/tree/main/examples/oss-gpt-20b-axolotl)               |
| TRL       | SFT, DPO, and GRPO with Hugging Face             | [oss-gpt-20b-lora-trl](https://github.com/basetenlabs/ml-cookbook/tree/main/examples/oss-gpt-20b-lora-trl)             |
| TRL       | LoRA DPO fine-tuning                             | [qwen3-8b-lora-dpo-trl](https://github.com/basetenlabs/ml-cookbook/tree/main/examples/qwen3-8b-lora-dpo-trl)           |
| VeRL      | Reinforcement learning with custom rewards       | [qwen3-8b-lora-verl](https://github.com/basetenlabs/ml-cookbook/tree/main/examples/qwen3-8b-lora-verl)                 |
| MS-Swift  | Long-context and multilingual training           | [qwen3-30b-mswift-multinode](https://github.com/basetenlabs/ml-cookbook/tree/main/examples/qwen3-30b-mswift-multinode) |

Browse the [ML Cookbook](https://github.com/basetenlabs/ml-cookbook) for more examples including multi-node training with FSDP and DeepSpeed.

## Key features

### Checkpoint management

Checkpoints sync automatically to Baseten storage during training. You can:

* **Deploy** any checkpoint as a production endpoint with [`truss train deploy_checkpoints`](/training/deployment).
* **Download** checkpoints for local evaluation and analysis.
* **Resume** from any checkpoint if a job fails or you want to train further.

Learn more about [checkpointing](/training/concepts/checkpoints).

### BDN weight and data loading

Load model weights and training data through [Baseten Delivery Network (BDN)](/training/concepts/storage#load-weights-and-data-with-bdn). Mount weights from Hugging Face, S3, GCS, Azure, R2, or any HTTPS URL directly into your training container with no download code needed. BDN mirrors weights before compute is provisioned, then caches them for faster mounting on subsequent jobs.

See [storage and data ingestion](/training/concepts/storage) for setup details.

### Persistent caching

Speed up training iterations by caching models, datasets, and preprocessed data between jobs. The cache persists across training runs, so you don't re-download 70B models every time.

See the [training cache](/training/concepts/cache) guide for configuration options.

### Multi-node training

Scale training across multiple GPU nodes with InfiniBand networking. Baseten handles node orchestration, communication setup, and environment variables. You just set `node_count` in your configuration.

Learn more about [multi-node training](/training/concepts/multinode).

### Remote access

Connect to running training containers to debug, inspect state, and iterate without resubmitting. Baseten offers two options:

* **[SSH](/training/ssh)**: Connect from any OpenSSH client for terminal sessions and file transfer with `scp` or `sftp`.
* **[VS Code & Cursor](/training/interactive-sessions)**: Connect from VS Code or Cursor Remote Tunnels for a full IDE experience.

See the [Remote access overview](/training/remote-access) to choose between them.

## Next steps

<CardGroup cols={3}>
  <Card title="Get started" icon="rocket" href="/training/getting-started">
    Run your first training job and deploy the result.
  </Card>

  <Card title="Loops" icon="link" href="/loops/overview">
    Tinker-compatible managed training with paired trainer and sampling servers.
  </Card>

  <Card title="ML Cookbook" icon="book" href="https://github.com/basetenlabs/ml-cookbook">
    Production-ready examples for frameworks and models.
  </Card>
</CardGroup>

## Reference

* [CLI reference](/reference/cli/training/training-cli)
* [SDK reference](/reference/sdk/training)
* [API reference](/reference/training-api/overview)
