Skip to main content
Training jobs need model weights, training datasets, and configuration files. Baseten provides multiple ways to get data into your training container, from cached delivery through Baseten Delivery Network (BDN) to direct downloads in your training script.

Load weights and data with BDN

Use the weights parameter on TrainingJob to mount model weights and training data into your container through BDN. BDN mirrors your data once and serves it from multi-tier caches, so subsequent jobs start faster.
BDN mirrors your weights to Baseten storage during the CREATED state, before any compute is provisioned. This mirroring step is not billed. Once the job enters DEPLOYING, compute billing begins. This includes the time BDN takes to mount cached weights into your container. Cached weights mount faster than first-time downloads, reducing billable deploy time on subsequent jobs.
Each weight source specifies a remote URI and a local mount path. When your container starts, the data is already available at the mount_location. No download code needed in your training script.

Hugging Face and S3 example

Load model weights from Hugging Face and training data from S3, mounted into the training container before your code runs:
config.py
from truss_train import TrainingProject, TrainingJob, Image, Compute, Runtime, WeightsSource
from truss.base.truss_config import AcceleratorSpec

training_job = TrainingJob(
    image=Image(base_image="pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime"),
    compute=Compute(
        accelerator=AcceleratorSpec(accelerator="H100", count=1),
    ),
    runtime=Runtime(
        start_commands=["python train.py"],
    ),
    weights=[
        WeightsSource(
            source="hf://Qwen/Qwen3-0.6B",
            mount_location="/app/models/Qwen/Qwen3-0.6B",
        ),
        WeightsSource(
            source="s3://my-bucket/training-data",
            mount_location="/app/data/training-data",
        ),
    ],
)

training_project = TrainingProject(name="qwen3-finetune", job=training_job)
In your training script, reference the mount paths directly:
train.py
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("/app/models/Qwen/Qwen3-0.6B")
tokenizer = AutoTokenizer.from_pretrained("/app/models/Qwen/Qwen3-0.6B")

# Training data is available at /app/data/training-data/

Supported sources

BDN supports these URI schemes:
SchemeExampleDescription
hf://hf://meta-llama/Llama-3.1-8B@mainHugging Face Hub.
s3://s3://my-bucket/path/to/dataAmazon S3.
gs://gs://my-bucket/path/to/dataGoogle Cloud Storage.
azure://azure://account/container/pathAzure Blob Storage.
r2://r2://account_id.bucket/pathCloudflare R2.
https://https://example.com/model.binDirect URL download.
For Hugging Face sources, pin to a specific revision with the @revision suffix (branch, tag, or commit SHA).

Authentication

Private or gated sources require authentication. Add an auth block to your WeightsSource:
Store a Hugging Face token as a Baseten secret:
WeightsSource(
    source="hf://meta-llama/Llama-3.1-8B@main",
    mount_location="/app/models/llama",
    auth={"auth_method": "CUSTOM_SECRET", "auth_secret_name": "hf_access_token"},
)
For the full list of authentication options and source-specific configuration, see the BDN configuration reference.

Filtering files

Use allow_patterns and ignore_patterns to download only the files you need:
WeightsSource(
    source="hf://meta-llama/Llama-3.1-8B@main",
    mount_location="/app/models/llama",
    allow_patterns=["*.safetensors", "config.json", "tokenizer.*"],
    ignore_patterns=["*.md", "*.txt"],
)

Storage types overview

Baseten Training provides three types of storage:
Storage typePersistenceUse case
Training cachePersistent between jobsLarge model downloads, preprocessed datasets, shared artifacts.
CheckpointingBacked up to cloud storageModel checkpoints, training artifacts you want to deploy or download.
Ephemeral storageCleared after job completesTemporary files, intermediate outputs.

Ephemeral storage

Ephemeral storage is cleared when your job completes. Use it for:
  • Temporary files during training.
  • Intermediate outputs that don’t need to persist.
  • Scratch space for data processing.
Ephemeral storage is typically limited to a few GBs and cannot affect other containers on the same node.

Loading data in your training script

When data isn’t available through a BDN-supported URI scheme, download it directly in your training script. This works well for datasets loaded from framework-specific libraries or custom download logic.
Use Baseten secrets to authenticate to your S3 bucket.
  1. Add your AWS credentials as secrets in your Baseten account.
  2. Reference the secrets in your job configuration:
    from truss_train import definitions
    
    runtime = definitions.Runtime(
        environment_variables={
            "AWS_ACCESS_KEY_ID": definitions.SecretReference(name="aws_access_key_id"),
            "AWS_SECRET_ACCESS_KEY": definitions.SecretReference(name="aws_secret_access_key"),
        },
    )
    
  3. Download from S3 in your training script:
    import boto3
    
    s3 = boto3.client('s3')
    s3.download_file('my-bucket', 'training-data.tar.gz', '/path/to/local/file')
    
To avoid re-downloading large datasets on each job, download to the training cache and check if files exist before downloading.

Data size and limits

SizeDescription
SmallA few GBs.
MediumUp to 1 TB (most common).
Large1-10 TB.
The default training cache is 1 TB. Contact support to increase the cache size for larger datasets.

Data security

Data transfer happens within Baseten’s VPC using secure connections. Baseten doesn’t share customer data across tenants and maintains a zero data retention policy. For self-hosted deployments, training can use storage buckets in your own AWS or GCP account.

Storage performance

Read and write speeds vary by cluster and storage configuration:
Storage typeWrite speedRead speed
Node storage1.2-1.8 GB/s1.7-2.1 GB/s
Training cache340 MB/s - 1.0 GB/s470 MB/s - 1.6 GB/s
For workloads with high I/O requirements or large storage requirements, contact support.

Next steps

  • BDN configuration reference: Full list of weight source options, authentication methods, and supported URI schemes.
  • Cache: Persist data between jobs and speed up training iterations.
  • Checkpointing: Save and manage model checkpoints during training.
  • Multinode training: Scale training across multiple nodes with shared cache access.