Training jobs need model weights, training datasets, and configuration files. Baseten provides multiple ways to get data into your training container, from cached delivery through Baseten Delivery Network (BDN) to direct downloads in your training script.Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Load weights and data with BDN
Use theweights parameter on
TrainingJob to mount model weights and
training data into your container through BDN. BDN mirrors your data once and
serves it from multi-tier caches, so subsequent jobs start faster.
BDN mirrors your weights to Baseten storage during the
CREATED state, before any compute is provisioned. Once your job is scheduled on a node, BDN places the weights on local disk before your start_commands run. Weight delivery never overlaps with workload execution, so BDN has no effect on training throughput. The only difference between a cache hit and a cache miss is how long the deploy phase takes.mount_location. No
download code needed in your training script.
Hugging Face and S3 example
Load model weights from Hugging Face and training data from S3, mounted into the training container before your code runs:config.py
train.py
Supported sources
BDN supports these URI schemes:| Scheme | Example | Description |
|---|---|---|
hf:// | hf://meta-llama/Llama-3.1-8B@main | Hugging Face Hub. |
s3:// | s3://my-bucket/path/to/data | Amazon S3. |
gs:// | gs://my-bucket/path/to/data | Google Cloud Storage. |
r2:// | r2://account_id.bucket/path | Cloudflare R2. |
@revision suffix (branch, tag, or commit SHA).
Authentication
Private or gated sources require authentication. Add anauth block to your WeightsSource:
- Hugging Face
- S3 (IAM credentials)
Store a Hugging Face token as a Baseten secret:
Filtering files
Useallow_patterns and ignore_patterns to download only the files you need:
How BDN serves training jobs
When you submit a training job, BDN compares yourweights config to what’s already in Baseten storage, pulls anything missing from the upstream source, and stages the full set on the node before your start_commands run.
Data delivery happens entirely during the CREATED and DEPLOYING phases.
Two cache tiers sit in front of Baseten’s mirror:
- Cluster-local cache: shared across nodes in a GPU cluster. Populated the first time a job in that cluster pulls a given set of files.
- Node-local cache: lives on the node itself. Populated when a job lands on that node.
BDN or training cache?
Use BDN for read-only inputs that are known at job start, like model weights and frozen datasets. Baseten delivers them before training begins, so you never pay for IO or compute time while they load. Use the training cache when you need read-write storage that persists across jobs, or when one job produces data that a later job consumes. Common examples: pip package installs, compiled artifacts, and preprocessed datasets you build once and reuse.Storage types overview
Baseten Training provides four ways to move data in and out of a job:| Storage type | Persistence | Use case |
|---|---|---|
BDN (weights) | Mirrored once; cluster- and node-local LRU caches | Read-only model weights and datasets known at job start. |
| Training cache | Read-write, persistent between jobs | Pip packages, compiled artifacts, preprocessed datasets. |
| Checkpointing | Backed up to cloud storage | Model checkpoints and artifacts you want to deploy or download. |
| Ephemeral storage | Cleared after job completes | Temporary files, intermediate outputs. |
Ephemeral storage
Write temporary files to the$BT_SCRATCH_DIR directory. This path is backed by local NVMe storage on the node and is cleared when your job completes. Use it for:
- Temporary files during training.
- Intermediate outputs that don’t need to persist.
- Scratch space for data processing.
Loading data in your training script
When data isn’t available through a BDN-supported URI scheme, download it directly in your training script. This works well for datasets loaded from framework-specific libraries or custom download logic.- Amazon S3
- Hugging Face
- Google Cloud Storage
Use Baseten secrets to authenticate to your S3 bucket.
- Add your AWS credentials as secrets in your Baseten account.
-
Reference the secrets in your job configuration:
-
Download from S3 in your training script:
Data size and limits
| Size | Description |
|---|---|
| Small | A few GBs. |
| Medium | Up to 1 TB (most common). |
| Large | 1-10 TB. |
Data security
Data transfer happens within Baseten’s VPC using secure connections. Baseten doesn’t share customer data across tenants. When you enable training cache, data persists between jobs until you delete the project. Ephemeral storage is cleared when your job completes. For self-hosted deployments, training can use storage buckets in your own AWS or GCP account. To learn more and access official policies and certifications, visit the Baseten Trust Center.Storage performance
Read and write speeds vary by cluster and storage configuration:| Storage type | Write speed | Read speed |
|---|---|---|
| Node storage | 1.2-1.8 GB/s | 1.7-2.1 GB/s |
| Training cache | 340 MB/s - 1.0 GB/s | 470 MB/s - 1.6 GB/s |
Next steps
- BDN configuration reference: Full list of weight source options, authentication methods, and supported URI schemes.
- Cache: Persist data between jobs and speed up training iterations.
- Checkpointing: Save and manage model checkpoints during training.
- Multinode training: Scale training across multiple nodes with shared cache access.