Baseten Delivery Network

Baseten Delivery Network (BDN) reduces cold start times by mirroring your model weights to Baseten’s infrastructure and caching them close to your pods. Instead of downloading hundreds of gigabytes from Hugging Face, S3, or GCS on every scale-up, BDN mirrors weights once and serves them from multi-tier caches. Configure BDN using the weights key in your config. This works with both Model class deployments and custom Docker images.

Get started

Add weights to a new model

Custom servers

Use with vLLM, SGLang, and more

Migrate

Move from model_cache

Quick start

Add a weights section to your config.yaml:

config.yaml

weights:
  - source: "hf://meta-llama/Llama-3.1-8B@main"
    mount_location: "/models/llama"
    allow_patterns: ["*.safetensors", "config.json"]
    ignore_patterns: ["*.md", "*.txt"]

Field	Description
`source`	Where to fetch weights from. Supports Hugging Face, S3, GCS, and R2.
`mount_location`	Absolute path where weights appear in your container.
`allow_patterns`	Optional. Only download files matching these patterns. Useful for skipping large files you don’t need. See filtering files.
`ignore_patterns`	Optional. Exclude files matching these patterns. Useful for skipping documentation or unused formats.

For private or gated models, add an auth section to reference a Baseten secret with your credentials.

Accessing weights in your model

When your model starts, weights are already downloaded and available at your mount_location. The directory structure from the source is preserved:

/models/llama/                           # Your mount_location
├── config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── ...
├── model.safetensors.index.json
├── tokenizer.json
├── tokenizer_config.json
└── original/                            # Subfolders are preserved
    ├── consolidated.00.pth
    └── params.json

Load weights directly from this path in your load() method. No download code needed:

model.py

from transformers import AutoModelForCausalLM

class Model:
    def load(self):
        # Weights are already available at mount_location
        self._model = AutoModelForCausalLM.from_pretrained(
            "/models/llama",
            torch_dtype=torch.float16,
            device_map="auto"
        )

The mount is read-only. Weights are fetched during truss push and cached, so cold starts only read from local or nearby caches.

Configuration reference

`weights`

A list of weight sources to mount into your model container.

config.yaml

weights:
  - source: "hf://meta-llama/Llama-3.1-8B@main"
    mount_location: "/models/llama"
    auth:
      auth_method: CUSTOM_SECRET
      auth_secret_name: "hf_access_token"
    allow_patterns: ["*.safetensors", "config.json"]
    ignore_patterns: ["*.md", "*.txt"]

source

string

required

URI specifying where to fetch weights from. Supported schemes:

hf://: Hugging Face Hub.
s3://: AWS S3.
gs://: Google Cloud Storage.
r2://: Cloudflare R2.

For Hugging Face sources, specify a revision using @revision suffix (branch, tag, or commit SHA).

mount_location

string

required

Absolute path where weights will be mounted in your container. Must start with /.

mount_location: "/models/llama"  # Correct
mount_location: "models/llama"   # Wrong - not absolute

auth

object

Authentication configuration for accessing private weight sources. See Source types and authentication for the expected format for each source type.

auth_method: The authentication method. Use CUSTOM_SECRET for secret-based auth, AWS_OIDC for AWS OIDC, or GCP_OIDC for GCP OIDC.
auth_secret_name: Name of a Baseten secret containing credentials (required for CUSTOM_SECRET).

allow_patterns

string[]

File patterns to include. Uses Unix shell-style wildcards. Only matching files will be downloaded.

allow_patterns:
  - "*.safetensors"
  - "config.json"
  - "tokenizer.*"

ignore_patterns

string[]

File patterns to exclude. Uses Unix shell-style wildcards. Matching files will be skipped.

ignore_patterns:
  - "*.md"
  - "*.txt"
  - "*.bin"  # Skip PyTorch .bin files if using safetensors

Source types and authentication

For private weight sources, create a Baseten secret with the appropriate credentials. Manage secrets in your Baseten settings.

Hugging Face

Download weights from Hugging Face Hub repositories.

config.yaml

weights:
  - source: "hf://meta-llama/Llama-3.1-8B@main"
    mount_location: "/models/llama"
    auth:
      auth_method: CUSTOM_SECRET
      auth_secret_name: "hf_access_token"  # Required for private/gated repos
    allow_patterns: ["*.safetensors", "config.json"]

Format: hf://owner/repo@revision

owner/repo: The Hugging Face repository.
@revision: Branch, tag, or commit SHA.

Revision pinning: When you use a branch name like @main, Baseten resolves it to the specific commit SHA at deploy time and mirrors those exact files. Your deployment stays pinned to that version. Subsequent scale-ups won’t pick up new commits. To update to newer weights, push a new deployment.

Authentication: Hugging Face API token (plain text)

Secret Name	Secret Value
`hf_access_token`	`hf_xxxxxxxxxxxxxxxx...`

Get your token from Hugging Face settings.

AWS S3

Download weights from an S3 bucket. AWS supports using either IAM credentials or OIDC for S3 authentication.

AWS OIDC (Recommended)

OIDC provides short-lived, narrowly scoped tokens for secure authentication without managing long-lived credentials.

Configure AWS to trust the Baseten OIDC provider and create an IAM role with S3 permissions.
Add the OIDC configuration to your config.yaml:

config.yaml

weights:
  - source: "s3://my-bucket/models/custom-weights"
    mount_location: "/models/custom"
    auth:
      auth_method: AWS_OIDC
      aws_oidc_role_arn: arn:aws:iam::<account-id>:role/baseten-s3-access
      aws_oidc_region: us-west-2

No secrets needed! The aws_oidc_role_arn and aws_oidc_region are not sensitive and can be committed to your repository.

See the OIDC authentication guide for detailed setup instructions and best practices.

IAM credentials

config.yaml

weights:
  - source: "s3://my-bucket/models/custom-weights"
    mount_location: "/models/custom"
    auth:
      auth_method: CUSTOM_SECRET
      auth_secret_name: "aws_credentials"

Format: s3://bucket/path Authentication: JSON with AWS credentials

Secret Name	Secret Value
`aws_credentials`	`{"aws_access_key_id": "AKIA...", "aws_secret_access_key": "...", "aws_region": "us-west-2"}`

All three fields (aws_access_key_id, aws_secret_access_key, and aws_region) are required.

Google Cloud Storage

Download weights from a GCS bucket. GCP supports using either service accounts or OIDC for GCS authentication.

GCP OIDC (Recommended)

OIDC provides short-lived, narrowly scoped tokens for secure authentication without managing long-lived credentials.

Configure GCP Workload Identity to trust the Baseten OIDC provider and grant GCS permissions.
Add the OIDC configuration to your config.yaml:

config.yaml

weights:
  - source: "gs://my-bucket/models/weights"
    mount_location: "/models/gcs-weights"
    auth:
      auth_method: GCP_OIDC
      gcp_oidc_service_account: baseten-oidc@my-project.iam.gserviceaccount.com
      gcp_oidc_workload_id_provider: projects/123456789/locations/global/workloadIdentityPools/baseten-pool/providers/baseten-provider

No secrets needed! The service account and workload identity provider are not sensitive and can be committed to your repository.

See the OIDC authentication guide for detailed setup instructions and best practices.

Service account

config.yaml

weights:
  - source: "gs://my-bucket/models/weights"
    mount_location: "/models/gcs-weights"
    auth:
      auth_method: CUSTOM_SECRET
      auth_secret_name: "gcp_service_account"

Format: gs://bucket/path Authentication: GCP service account JSON key

Secret Name	Secret Value
`gcp_service_account`	`{"type": "service_account", "project_id": "...", ...}`

Download from GCP Console under IAM & Admin > Service Accounts.

Cloudflare R2

Download weights from a Cloudflare R2 bucket.

config.yaml

weights:
  - source: "r2://abc123def.my-bucket/models/weights"
    mount_location: "/models/r2-weights"
    auth:
      auth_method: CUSTOM_SECRET
      auth_secret_name: "r2_credentials"

Format: r2://account_id.bucket/path

account_id: Your Cloudflare account ID.
bucket: R2 bucket name, separated from account_id by a period.
path: Path prefix within the bucket.

Authentication: JSON with R2 API credentials

Secret Name	Secret Value
`r2_credentials`	`{"aws_access_key_id": "...", "aws_secret_access_key": "..."}`

Get your R2 API tokens from the Cloudflare dashboard under R2 > Manage R2 API Tokens.

Migration from `model_cache`

model_cache is deprecated. Migrate to weights for faster cold starts through multi-tier caching.

Automated migration with `truss migrate`

The truss migrate CLI command automatically converts model_cache configurations:

# Run in your Truss directory
truss migrate

# Or specify a directory
truss migrate /path/to/truss

The command will:

Show a colorized diff of the proposed changes.
Prompt for confirmation before applying.
Create a backup of your original config.yaml.
Warn about any model.py path changes needed.

Manual migration reference

From model_cache to weights:

`model_cache`	`weights`
`repo_id: "owner/repo"`	`source: "hf://owner/repo@rev"`
`revision: "main"`	Included in source URI as `@main`
`kind: "s3"`	Prefix: `s3://bucket/path`
`kind: "gcs"`	Prefix: `gs://bucket/path`
`kind: "r2"`	Prefix: `r2://account_id.bucket/path`
`volume_folder: "name"`	`mount_location: "/app/model_cache/name"`
`runtime_secret_name`	`auth.auth_secret_name`
`allow_patterns`	`allow_patterns` (same)
`ignore_patterns`	`ignore_patterns` (same)

Example migration:

After (weights)
Before (model_cache)

config.yaml

weights:
  - source: "hf://meta-llama/Llama-3.1-8B@main"
    mount_location: "/app/model_cache/llama"
    allow_patterns:
      - "*.safetensors"
      - "config.json"
    auth:
      auth_method: CUSTOM_SECRET
      auth_secret_name: hf_access_token

config.yaml

model_cache:
  - repo_id: meta-llama/Llama-3.1-8B
    revision: main
    use_volume: true
    volume_folder: llama
    allow_patterns:
      - "*.safetensors"
      - "config.json"
    runtime_secret_name: hf_access_token

Chains migration

For Truss Chains, update Assets.cached to Assets.weights in your Python code:

After (weights)
Before (cached)

import truss_chains as chains
from truss.base import truss_config

class MyChainlet(chains.ChainletBase):
    remote_config = chains.RemoteConfig(
        assets=chains.Assets(
            weights=[
                truss_config.WeightsSource(
                    source="hf://meta-llama/Llama-3.1-8B@main",
                    mount_location="/app/model_cache/llama",
                    auth_secret_name="hf_access_token",
                    allow_patterns=["*.safetensors", "config.json"],
                )
            ],
            secret_keys=["hf_access_token"],
        ),
    )

import truss_chains as chains
from truss.base import truss_config

class MyChainlet(chains.ChainletBase):
    remote_config = chains.RemoteConfig(
        assets=chains.Assets(
            cached=[
                truss_config.ModelRepo(
                    repo_id="meta-llama/Llama-3.1-8B",
                    revision="main",
                    use_volume=True,
                    volume_folder="llama",
                    allow_patterns=["*.safetensors", "config.json"],
                    runtime_secret_name="hf_access_token",
                )
            ],
            secret_keys=["hf_access_token"],
        ),
    )

Key changes:

ModelRepo → WeightsSource.
repo_id + revision → source URI with @revision suffix.
volume_folder → mount_location (must be absolute path).
runtime_secret_name → auth.auth_secret_name (inside an auth block with auth_method: CUSTOM_SECRET).
Remove use_volume and kind (inferred from URI scheme).

Custom server migration

If you’re using a custom server with model_cache, you’ll need to make additional changes when migrating to weights:

Remove truss-transfer-cli from your start_command. With weights, files are pre-mounted before your container starts.
Update file paths from /app/model_cache/{volume_folder} to your new mount_location.

After (weights)
Before (model_cache)

config.yaml

docker_server:
  # No truss-transfer-cli needed - weights are pre-mounted
  start_command: text-embeddings-router --port 7997
    --model-id /models/jina --max-client-batch-size 128
weights:
  - source: "hf://jinaai/jina-embeddings-v2-base-code@516f4baf..."
    mount_location: "/models/jina"
    ignore_patterns: ["*.onnx"]

config.yaml

docker_server:
  # Required truss-transfer-cli to download weights
  start_command: bash -c "truss-transfer-cli && text-embeddings-router --port 7997
    --model-id /app/model_cache/my_jina --max-client-batch-size 128"
model_cache:
  - repo_id: jinaai/jina-embeddings-v2-base-code
    revision: 516f4baf13dec4ddddda8631e019b5737c8bc250
    use_volume: true
    volume_folder: my_jina
    ignore_patterns: ["*.onnx"]

Best practices

Pin to specific commits

Avoid using branch names like @main in production. While Baseten pins to the commit SHA at deploy time, using @main means each new deployment may get different weights, making debugging and rollbacks difficult.

Always pin to a specific commit SHA for reproducible deployments:

# Recommended - reproducible across deploys
weights:
  - source: "hf://meta-llama/Llama-3.1-8B@5206a32e7b8a9f1c..."
    mount_location: "/models/llama"

# Not recommended for production - each new deployment resolves to a different commit
weights:
  - source: "hf://meta-llama/Llama-3.1-8B@main"
    mount_location: "/models/llama"

To find the current commit SHA for a Hugging Face repo:

# Using the Hugging Face CLI
huggingface-cli repo-info meta-llama/Llama-3.1-8B --revision main

Filter files with patterns

Only download what you need to minimize cold start time:

weights:
  - source: "hf://meta-llama/Llama-3.1-8B@main"
    mount_location: "/models/llama"
    allow_patterns:
      - "*.safetensors"    # Model weights
      - "config.json"      # Model config
      - "tokenizer.*"      # Tokenizer files
    ignore_patterns:
      - "*.bin"            # Skip PyTorch format if using safetensors
      - "*.md"             # Skip documentation
      - "*.txt"            # Skip text files

Use absolute mount paths

The mount_location must be an absolute path (starting with /):

# Correct
mount_location: "/models/llama"
mount_location: "/app/model_cache/my-model"

# Wrong - will fail validation
mount_location: "models/llama"
mount_location: "./my-model"

Keep mount locations unique

Each weight source must have a unique mount_location:

# Correct - different paths
weights:
  - source: "hf://meta-llama/Llama-3.1-8B@main"
    mount_location: "/models/llama"
  - source: "hf://sentence-transformers/all-MiniLM-L6-v2@main"
    mount_location: "/models/embeddings"

# Wrong - duplicate paths will fail
weights:
  - source: "hf://model-a@main"
    mount_location: "/models/shared"
  - source: "hf://model-b@main"
    mount_location: "/models/shared"

When weights are re-mirrored

Baseten caches weights based on a hash of their configuration and reuses cached weights when possible to avoid redundant downloads. Deduplication and mutation detection: Baseten deduplicates files based on their etag (a content hash), not just filename, and only re-mirrors files that have been mutated since the last pull. Unchanged files are reused from blob storage, even across deployments. Changes that trigger re-mirroring:

Field	Re-mirrors?	Why
`source`	✅ Yes	Different repository, revision, or path
`allow_patterns`	✅ Yes	Different files will be downloaded
`ignore_patterns`	✅ Yes	Different files will be downloaded

Changes that do NOT trigger re-mirroring:

Field	Re-mirrors?	Why
`auth`	❌ No	Credentials don’t affect which files are mirrored
`mount_location`	❌ No	Only affects where weights appear in your container

To force a fresh download of weights that haven’t changed, modify the source to point to a specific commit SHA instead of a branch name, or add a trivial change to allow_patterns.

How it works

What happens when you `truss push`

Your truss push command returns immediately after the deployment is created in Baseten. The mirroring process runs in the background, but your model will not be deployed to the Workload Plane until mirroring completes. This ensures weights are available before your model pod starts.

What happens on cold start

Baseten runs multiple Workload Planes across regions and clusters. Each Workload Plane has its own in-cluster cache for fast weight delivery: When your model pod starts:

The BDN Agent on the node fetches the manifest for your weights.
Weights are downloaded through the In-Cluster Cache (shared across pods in the cluster).
Weights are stored in the Node Cache (part of the BDN Agent, shared across pods on the same node).
Weights are mounted read-only to your model pod.

Key benefits

Non-blocking push → truss push returns immediately; mirroring happens in the background.
One-time mirroring → Weights are mirrored to Baseten storage once, not on every cold start.
No upstream dependency at runtime → Once mirrored, scale-ups and inference never contact the original source.
Multi-tier caching → In-cluster cache prevents redundant downloads; node cache provides instant access for subsequent replicas.
Deduplication → Identical weight files are stored once and shared via hardlinks.
Parallel downloads → Large models download faster with concurrent chunk fetching.

Next steps

Secrets — Store credentials for private weight sources.
Custom Docker images — Deploy vLLM, SGLang, and other inference servers.
Autoscaling — Configure replica scaling and cold start behavior.
Configuration reference — Full list of weights options.

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

Get started

Custom servers

Migrate

​Quick start

​Accessing weights in your model

​Configuration reference

​weights

​Source types and authentication

​Hugging Face

​AWS S3

​AWS OIDC (Recommended)

​IAM credentials

​Google Cloud Storage

​GCP OIDC (Recommended)

​Service account

​Cloudflare R2

​Migration from model_cache

​Automated migration with truss migrate

​Manual migration reference

​Chains migration

​Custom server migration

​Best practices

​Pin to specific commits

​Filter files with patterns

​Use absolute mount paths

​Keep mount locations unique

​When weights are re-mirrored

​How it works

​What happens when you truss push

​What happens on cold start

​Key benefits

​Next steps

Quick start

Accessing weights in your model

Configuration reference

`weights`

Source types and authentication

Hugging Face

AWS S3

AWS OIDC (Recommended)

IAM credentials

Google Cloud Storage

GCP OIDC (Recommended)

Service account

Cloudflare R2

Migration from `model_cache`

Automated migration with `truss migrate`

Manual migration reference

Chains migration

Custom server migration

Best practices

Pin to specific commits

Filter files with patterns

Use absolute mount paths

Keep mount locations unique

When weights are re-mirrored

How it works

What happens when you `truss push`

What happens on cold start

Key benefits

Next steps