Skip to main content
Training jobs require data, whether that’s training datasets, model weights, or configuration files. This guide covers how to get your data into Baseten and the storage options available during training.

Storage types overview

Baseten Training provides three types of storage:
Storage typePersistenceUse case
Training cachePersistent between jobsLarge model downloads, preprocessed datasets, shared artifacts.
CheckpointingBacked up to cloud storageModel checkpoints, training artifacts you want to deploy or download.
Ephemeral storageCleared after job completesTemporary files, intermediate outputs.

Ephemeral storage

Ephemeral storage is cleared when your job completes. Use it for:
  • Temporary files during training.
  • Intermediate outputs that don’t need to persist.
  • Scratch space for data processing.
Ephemeral storage is typically limited to a few GBs and cannot affect other containers on the same node.

Getting data into training

The most common pattern is to upload your training data to cloud storage separately, then download it to persistent storage during your job. This approach is faster than bundling large datasets with your job submission.
S3 is the most common method for loading training data. Use Baseten Secrets to authenticate to your S3 bucket.
  1. Add your AWS credentials as secrets in your Baseten account.
  2. Reference the secrets in your job configuration:
from truss_train import definitions

runtime = definitions.Runtime(
    environment_variables={
        "AWS_ACCESS_KEY_ID": definitions.SecretReference(name="aws_access_key_id"),
        "AWS_SECRET_ACCESS_KEY": definitions.SecretReference(name="aws_secret_access_key"),
    },
)
  1. Download from S3 in your training script:
import boto3

s3 = boto3.client('s3')
s3.download_file('my-bucket', 'training-data.tar.gz', '/path/to/local/file')
To avoid re-downloading large datasets on each job, download to the training cache and check if files exist before downloading.

Data size and limits

Baseten Training supports datasets of various sizes:
SizeDescription
SmallA few GBs.
MediumUp to 1 TB (most common).
Large1-10 TB.
The default training cache is 1 TB. Contact support to increase the cache size for larger datasets.

Data security

Baseten handles all data securely:
  • Data transfer happens within Baseten’s VPC using secure connections.
  • Baseten does not share customer data across tenants.
  • Baseten maintains a zero data retention policy.
  • For self-hosted deployments, training can use storage buckets in your own AWS or GCP account.

Storage performance

Read and write speeds vary by cluster and storage configuration:
Storage typeWrite speedRead speed
Node storage1.2-1.8 GB/s1.7-2.1 GB/s
Training cache340 MB/s - 1.0 GB/s470 MB/s - 1.6 GB/s
For workloads with very high I/O requirements or large storage requirements, contact support.

Next steps training/concepts/storage.mdx

  • Cache: Learn how to persist data between jobs and speed up training iterations.
  • Checkpointing: Save and manage model checkpoints during training.
  • Multinode training: Scale training across multiple nodes with shared cache access.