Training jobs require data, whether that’s training datasets, model weights, or configuration files. This guide covers how to get your data into Baseten and the storage options available during training.
Storage types overview
Baseten Training provides three types of storage:
| Storage type | Persistence | Use case |
|---|
| Training cache | Persistent between jobs | Large model downloads, preprocessed datasets, shared artifacts. |
| Checkpointing | Backed up to cloud storage | Model checkpoints, training artifacts you want to deploy or download. |
| Ephemeral storage | Cleared after job completes | Temporary files, intermediate outputs. |
Ephemeral storage
Ephemeral storage is cleared when your job completes. Use it for:
- Temporary files during training.
- Intermediate outputs that don’t need to persist.
- Scratch space for data processing.
Ephemeral storage is typically limited to a few GBs and cannot affect other containers on the same node.
Getting data into training
The most common pattern is to upload your training data to cloud storage separately, then download it to persistent storage during your job. This approach is faster than bundling large datasets with your job submission.
Amazon S3
Hugging Face
Google Cloud Storage
S3 is the most common method for loading training data. Use Baseten Secrets to authenticate to your S3 bucket.
-
Add your AWS credentials as secrets in your Baseten account.
-
Reference the secrets in your job configuration:
from truss_train import definitions
runtime = definitions.Runtime(
environment_variables={
"AWS_ACCESS_KEY_ID": definitions.SecretReference(name="aws_access_key_id"),
"AWS_SECRET_ACCESS_KEY": definitions.SecretReference(name="aws_secret_access_key"),
},
)
- Download from S3 in your training script:
import boto3
s3 = boto3.client('s3')
s3.download_file('my-bucket', 'training-data.tar.gz', '/path/to/local/file')
To avoid re-downloading large datasets on each job, download to the training cache and check if files exist before downloading. Upload your dataset to Hugging Face, then reference it in your training code:from datasets import load_dataset
ds = load_dataset("your-username/your-dataset", split="train")
For private datasets, authenticate using a Hugging Face token stored in Baseten Secrets:runtime = definitions.Runtime(
environment_variables={
"HF_TOKEN": definitions.SecretReference(name="hf_access_token"),
},
)
Authenticate via Baseten Secrets and download in your training code:from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('training-data.tar.gz')
blob.download_to_filename('/path/to/local/file')
Data size and limits
Baseten Training supports datasets of various sizes:
| Size | Description |
|---|
| Small | A few GBs. |
| Medium | Up to 1 TB (most common). |
| Large | 1-10 TB. |
The default training cache is 1 TB. Contact support to increase the cache size for larger datasets.
Data security
Baseten handles all data securely:
- Data transfer happens within Baseten’s VPC using secure connections.
- Baseten does not share customer data across tenants.
- Baseten maintains a zero data retention policy.
- For self-hosted deployments, training can use storage buckets in your own AWS or GCP account.
Read and write speeds vary by cluster and storage configuration:
| Storage type | Write speed | Read speed |
|---|
| Node storage | 1.2-1.8 GB/s | 1.7-2.1 GB/s |
| Training cache | 340 MB/s - 1.0 GB/s | 470 MB/s - 1.6 GB/s |
For workloads with very high I/O requirements or large storage requirements, contact support.
Next steps training/concepts/storage.mdx
- Cache: Learn how to persist data between jobs and speed up training iterations.
- Checkpointing: Save and manage model checkpoints during training.
- Multinode training: Scale training across multiple nodes with shared cache access.