Storage and data ingestion

Training jobs need model weights, training datasets, and configuration files. Baseten provides multiple ways to get data into your training container, from cached delivery through Baseten Delivery Network (BDN) to direct downloads in your training script.

Load weights and data with BDN

Use the weights parameter on TrainingJob to mount model weights and training data into your container through BDN. BDN mirrors your data once and serves it from multi-tier caches, so subsequent jobs start faster.

BDN mirrors your weights to Baseten storage during the CREATED state, before any compute is provisioned. This mirroring step is not billed. Once the job enters DEPLOYING, compute billing begins. This includes the time BDN takes to mount cached weights into your container. Cached weights mount faster than first-time downloads, reducing billable deploy time on subsequent jobs.

Each weight source specifies a remote URI and a local mount path. When your container starts, the data is already available at the mount_location. No download code needed in your training script.

Hugging Face and S3 example

Load model weights from Hugging Face and training data from S3, mounted into the training container before your code runs:

config.py

from truss_train import TrainingProject, TrainingJob, Image, Compute, Runtime, WeightsSource
from truss.base.truss_config import AcceleratorSpec

training_job = TrainingJob(
    image=Image(base_image="pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"),
    compute=Compute(
        accelerator=AcceleratorSpec(accelerator="H100", count=1),
    ),
    runtime=Runtime(
        start_commands=["python train.py"],
    ),
    weights=[
        WeightsSource(
            source="hf://Qwen/Qwen3-0.6B",
            mount_location="/app/models/Qwen/Qwen3-0.6B",
        ),
        WeightsSource(
            source="s3://my-bucket/training-data",
            mount_location="/app/data/training-data",
        ),
    ],
)

training_project = TrainingProject(name="qwen3-finetune", job=training_job)

In your training script, reference the mount paths directly:

train.py

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("/app/models/Qwen/Qwen3-0.6B")
tokenizer = AutoTokenizer.from_pretrained("/app/models/Qwen/Qwen3-0.6B")

# Training data is available at /app/data/training-data/

Supported sources

BDN supports these URI schemes:

Scheme	Example	Description
`hf://`	`hf://meta-llama/Llama-3.1-8B@main`	Hugging Face Hub.
`s3://`	`s3://my-bucket/path/to/data`	Amazon S3.
`gs://`	`gs://my-bucket/path/to/data`	Google Cloud Storage.
`azure://`	`azure://account/container/path`	Azure Blob Storage.
`r2://`	`r2://account_id.bucket/path`	Cloudflare R2.
`https://`	`https://example.com/model.bin`	Direct URL download.

For Hugging Face sources, pin to a specific revision with the @revision suffix (branch, tag, or commit SHA).

Authentication

Private or gated sources require authentication. Add an auth block to your WeightsSource:

Hugging Face
S3 (IAM credentials)
S3 (OIDC)

Store a Hugging Face token as a Baseten secret:

WeightsSource(
    source="hf://meta-llama/Llama-3.1-8B@main",
    mount_location="/app/models/llama",
    auth={"auth_method": "CUSTOM_SECRET", "auth_secret_name": "hf_access_token"},
)

Store AWS credentials as a JSON Baseten secret:

WeightsSource(
    source="s3://my-bucket/training-data",
    mount_location="/app/data/training-data",
    auth={"auth_method": "CUSTOM_SECRET", "auth_secret_name": "aws_credentials"},
)

The secret value must contain aws_access_key_id, aws_secret_access_key, and aws_region.

Use OIDC authentication for short-lived tokens without managing long-lived credentials:

WeightsSource(
    source="s3://my-bucket/training-data",
    mount_location="/app/data/training-data",
    auth={
        "auth_method": "AWS_OIDC",
        "aws_oidc_role_arn": "arn:aws:iam::<account-id>:role/baseten-s3-access",
        "aws_oidc_region": "us-west-2",
    },
)

For the full list of authentication options and source-specific configuration, see the BDN configuration reference.

Filtering files

Use allow_patterns and ignore_patterns to download only the files you need:

WeightsSource(
    source="hf://meta-llama/Llama-3.1-8B@main",
    mount_location="/app/models/llama",
    allow_patterns=["*.safetensors", "config.json", "tokenizer.*"],
    ignore_patterns=["*.md", "*.txt"],
)

Storage types overview

Baseten Training provides three types of storage:

Storage type	Persistence	Use case
Training cache	Persistent between jobs	Large model downloads, preprocessed datasets, shared artifacts.
Checkpointing	Backed up to cloud storage	Model checkpoints, training artifacts you want to deploy or download.
Ephemeral storage	Cleared after job completes	Temporary files, intermediate outputs.

Ephemeral storage

Ephemeral storage is cleared when your job completes. Use it for:

Temporary files during training.
Intermediate outputs that don’t need to persist.
Scratch space for data processing.

Ephemeral storage is typically limited to a few GBs and cannot affect other containers on the same node.

Loading data in your training script

When data isn’t available through a BDN-supported URI scheme, download it directly in your training script. This works well for datasets loaded from framework-specific libraries or custom download logic.

Amazon S3
Hugging Face
Google Cloud Storage

Use Baseten secrets to authenticate to your S3 bucket.

Add your AWS credentials as secrets in your Baseten account.

Reference the secrets in your job configuration:

from truss_train import definitions

runtime = definitions.Runtime(
    environment_variables={
        "AWS_ACCESS_KEY_ID": definitions.SecretReference(name="aws_access_key_id"),
        "AWS_SECRET_ACCESS_KEY": definitions.SecretReference(name="aws_secret_access_key"),
    },
)

Download from S3 in your training script:

import boto3

s3 = boto3.client('s3')
s3.download_file('my-bucket', 'training-data.tar.gz', '/path/to/local/file')

To avoid re-downloading large datasets on each job, download to the training cache and check if files exist before downloading.

Reference a Hugging Face dataset in your training code:

from datasets import load_dataset

ds = load_dataset("your-username/your-dataset", split="train")

For private datasets, authenticate using a Hugging Face token stored in Baseten secrets:

runtime = definitions.Runtime(
    environment_variables={
        "HF_TOKEN": definitions.SecretReference(name="hf_access_token"),
    },
)

Authenticate via Baseten secrets and download in your training code:

from google.cloud import storage

client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('training-data.tar.gz')
blob.download_to_filename('/path/to/local/file')

Data size and limits

Size	Description
Small	A few GBs.
Medium	Up to 1 TB (most common).
Large	1-10 TB.

The default training cache is 1 TB. Contact support to increase the cache size for larger datasets.

Data security

Data transfer happens within Baseten’s VPC using secure connections. Baseten doesn’t share customer data across tenants and maintains a zero data retention policy. For self-hosted deployments, training can use storage buckets in your own AWS or GCP account.

Storage performance

Read and write speeds vary by cluster and storage configuration:

Storage type	Write speed	Read speed
Node storage	1.2-1.8 GB/s	1.7-2.1 GB/s
Training cache	340 MB/s - 1.0 GB/s	470 MB/s - 1.6 GB/s

For workloads with high I/O requirements or large storage requirements, contact support.

Next steps

BDN configuration reference: Full list of weight source options, authentication methods, and supported URI schemes.
Cache: Persist data between jobs and speed up training iterations.
Checkpointing: Save and manage model checkpoints during training.
Multinode training: Scale training across multiple nodes with shared cache access.

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

Storage and data ingestion

Load weights and data with BDN

Hugging Face and S3 example

Supported sources

Authentication

Filtering files

Storage types overview

Ephemeral storage

Loading data in your training script

Data size and limits

Data security

Storage performance

Next steps

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

​Load weights and data with BDN

​Hugging Face and S3 example

​Supported sources

​Authentication

​Filtering files

​Storage types overview

​Ephemeral storage

​Loading data in your training script

​Data size and limits

​Data security

​Storage performance

​Next steps

Load weights and data with BDN

Hugging Face and S3 example

Supported sources

Authentication

Filtering files

Storage types overview

Ephemeral storage

Loading data in your training script

Data size and limits

Data security

Storage performance

Next steps