What is a β€œcold start”?

Cold start” is a term used to describe the time taken when a request is received when the model is scaled to 0 until it is ready to handle the first request. This process is a critical factor in allowing your deployments to be responsive to traffic while maintaining your SLAs and lowering your costs. To optimize cold starts, we will go over the following stategies: Downloading them in a background thread in Rust that runs during the module import, caching weights in a distributed filesystem, and moving weights into the docker image.

In practice, this reduces the cold start for large models to just a few seconds. For example, Stable Diffusion XL can take a few minutes to boot up without caching. With caching, it takes just under 10 seconds.

Enabling Caching + Prefetching for a Model

To enable caching, simply add model_cache to your config.yaml with a valid repo_id. The model_cache has a few key configurations:

  • repo_id (required): The repo name from Hugging Face.
  • revision (required): The revision of the huggingface repo, such as the sha or branch name such as refs/pr/1 or main.
  • use_volume: Boolean flag to determine if the weights are downloaded to the Baseten Filesystem at runtime (recommended) or bundled into the container image (not recommended).
  • volume_folder: string, folder name under which the model weights appear. Setting it to my-llama-model will mount the repo to /app/model_cache/my-llama-model at runtime.
  • allow_patterns: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.
  • ignore_patterns: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.

Here is an example of a well written model_cache for Stable Diffusion XL. Note how it only pulls the model weights that it needs using allow_patterns.

config.yaml
model_cache:
  - repo_id: madebyollin/sdxl-vae-fp16-fix
    revision: 207b116dae70ace3637169f1ddd2434b91b3a8cd
    use_volume: true
    volume_folder: sdxl-vae-fp16
    allow_patterns:
      - config.json
      - diffusion_pytorch_model.safetensors
  - repo_id: stabilityai/stable-diffusion-xl-base-1.0
    revision: 462165984030d82259a11f4367a4eed129e94a7b
    use_volume: true
    volume_folder: stable-diffusion-xl-base
    allow_patterns:
      - "*.json"
      - "*.fp16.safetensors"
      - sd_xl_base_1.0.safetensors
  - repo_id: stabilityai/stable-diffusion-xl-refiner-1.0
    revision: 5d4cfe854c9a9a87939ff3653551c2b3c99a4356
    use_volume: true
    volume_folder: stable-diffusion-xl-refiner
    allow_patterns:
      - "*.json"
      - "*.fp16.safetensors"
      - sd_xl_refiner_1.0.safetensors

Many Hugging Face repos have model weights in different formats (.bin, .safetensors, .h5, .msgpack, etc.). You only need one of these most of the time. To minimize cold starts, ensure that you only cache the weights you need.

What is weight β€œpre-fetching”?

With model_cache, weights are pre-fetched by downloading your weights ahead of time in a dedicated Rust thread. This means, you can perform all kinds of preparation work (importing libraries, jit compilation of torch/triton modules), until you need access to the files. In practice, executing statements like import tensorrt_llm typically take 10–15 seconds. By that point, the first 5–10GB of the weights will have already been downloaded.

To use the model_cache config with truss, we require you to actively interact with the lazy_data_resolver. Before using any of the downloaded files, you must call the lazy_data_resolver.block_until_download_complete(). This will block until all files in the /app/model_cache directory are downloaded & ready to use. This call must be either part of your __init__ or load implementation.

model.py
# <- download is invoked before here.
import torch # this line usually takes 2-5 seconds.
import tensorrt_llm # this line usually takes 10-15 seconds
import onnxruntime # this line usually takes 5-10 seconds

class Model:
    """example usage of `model_cache` in truss"""
    def __init__(self, *args, **kwargs):
        # `lazy_data_resolver` is passed as keyword-argument in init
        self._lazy_data_resolver = kwargs["lazy_data_resolver"]

    def load():
        # work that does not require the download may be done beforehand
        random_vector = torch.randn(1000)
        # important to collect the download before using any incomplete data
        self._lazy_data_resolver.block_until_download_complete()
        # after the call, you may use the /app/model_cache directory
        torch.load(
            "/app/model_cache/your_model.pt"
        ) * random_vector

Private Hugging Face repositories πŸ€—

For any public Hugging Face repo, you don’t need to do anything else. Adding the model_cache key with an appropriate repo_id should be enough.

However, if you want to deploy a model from a gated repo like Llama 2 to Baseten, there are a few steps you need to take:

1

Get Hugging Face API Key

Grab an API key from Hugging Face with read access. Make sure you have access to the model you want to serve.

2

Add it to Baseten Secrets Manager

Paste your API key in your secrets manager in Baseten under the key hf_access_token. You can read more about secrets here.

3

Update Config

In your Truss’s config.yaml, add the following code:

config.yaml
secrets:
  hf_access_token: null

Make sure that the key secrets only shows up once in your config.yaml.

If you run into any issues, run through all the steps above again and make sure you did not misspell the name of the repo or paste an incorrect API key.

model_cache within Chains

To use model_cache for chains - use the Assets specifier. In the example below, we will download llama-3.2-1B. As this model is a gated huggingface model, we are setting the mounting token as part of the assets chains.Assets(..., secret_keys=["hf_access_token"]). The model is quite small - in many cases, we will be able to download the model while from transformers import pipeline and import torch are running.

chain_cache.py
import random
import truss_chains as chains

try:
    # imports on global level for PoemGeneratorLM, to save time during the download.
    from transformers import pipeline
    import torch
except ImportError:
    # RandInt does not have these dependencies.
    pass

class RandInt(chains.ChainletBase):
    async def run_remote(self, max_value: int) -> int:
        return random.randint(1, max_value)

@chains.mark_entrypoint
class PoemGeneratorLM(chains.ChainletBase):
    from truss import truss_config
    LLAMA_CACHE = truss_config.ModelRepo(
        repo_id="meta-llama/Llama-3.2-1B-Instruct",
        revision="c4219cc9e642e492fd0219283fa3c674804bb8ed",
        use_volume=True,
        volume_folder="llama_mini",
        ignore_patterns=["*.pth", "*.onnx"]
    )
    remote_config = chains.RemoteConfig(
        docker_image=chains.DockerImage(
            # The phi model needs some extra python packages.
            pip_requirements=[
                "transformers==4.48.0",
                "torch==2.6.0",
            ]
        ),
        compute=chains.Compute(
            gpu="L4"
        ),
        # The phi model needs a GPU and more CPUs.
        # compute=chains.Compute(cpu_count=2, gpu="T4"),
        # Cache the model weights in the image
        assets=chains.Assets(cached=[LLAMA_CACHE], secret_keys=["hf_access_token"]),
    )
    # <- Download happens before __init__ is called.
    def __init__(self, rand_int=chains.depends(RandInt, retries=3)) -> None:
        self._rand_int = rand_int
        print("loading cached llama_mini model")
        self.pipeline = pipeline(
            "text-generation",
            model=f"/app/model_cache/llama_mini",
        )

    async def run_remote(self, max_value: int = 3) -> str:
        num_repetitions = await self._rand_int.run_remote(max_value)
        print("writing poem with num_repetitions", num_repetitions)
        poem = str(self.pipeline(
            text_inputs="Write a beautiful and descriptive poem about the ocean. Focus on its vastness, movement, and colors.",
            max_new_tokens=150,
            do_sample=True,
            return_full_text=False,
            temperature=0.7,
            top_p=0.9,
        )[0]['generated_text'])
        return poem * num_repetitions

model_cache for custom servers

If you are not using Python’s model.py and custom servers such as vllm, TEI or sglang, you are required to use the truss-transfer-cli command, to force population of the /app/model_cache location. The command will block until the weights are downloaded.

Here is an example for how to use text-embeddings-inference on a L4 to populate a jina embeddings model from huggingface into the model_cache.

config.yaml
base_image:
  image: baseten/text-embeddings-inference-mirror:89-1.6
docker_server:
  liveness_endpoint: /health
  predict_endpoint: /v1/embeddings
  readiness_endpoint: /health
  server_port: 7997
  # using `truss-transfer-cli` to download the weights to `cached_model`
  start_command: bash -c "truss-transfer-cli && text-embeddings-router --port 7997
    --model-id /app/model_cache/my_jina --max-client-batch-size 128 --max-concurrent-requests
    128 --max-batch-tokens 16384 --auto-truncate"
model_cache:
- repo_id: jinaai/jina-embeddings-v2-base-code
  revision: 516f4baf13dec4ddddda8631e019b5737c8bc250
  use_volume: true
  volume_folder: my_jina
  ignore_patterns: ["*.onnx"]
model_metadata:
  example_model_input:
    encoding_format: float
    input: text string
    model: model
model_name: TEI-jinaai-jina-embeddings-v2-base-code-truss-example
resources:
  accelerator: L4

Optimizing access time futher with b10cache enabled

b10cache is currently in beta mode

To further reduce weights loading time, we can enable Baseten’s Distributed Filesystem (b10cache) for your account. You can validate that this is enabled for your account by viewing the logs of your deployment.

[2025-09-10 01:04:35] [INFO ] b10cache is enabled.
[2025-09-10 01:04:35] [INFO ] Symlink created successfully. Skipping download for /app/model_cache/cached_model/model.safetensors

Once b10cache is active, we will skip downloads that are cached in the filesystem of the region your deployment is running in. b10cache acts like a content delivery network: Initial cache misses are populating the filesystem, unused files are garbage collected after 14 days. Once b10cache is active, it will pull from the fastest source. If another pod is active on the same physical node, artifacts may be hot-cached, and shared among your deployments. Downloads are fully isolated from other organizations.

If b10cache is not available for your account, we will provision the model_cache with a optimized download from HuggingFace.co. The download is parallellized, achieving typical download speeds of greater than 1GB/s on a 10Gbit ethernet connection. If you want to enable b10cache, feel free to reach out to our support.

Legacy cache - weights in container

A slower way to make sure your weights are always available, is to download them into the docker image at build time. We recommend this only for small models, of up to a size of ~1GB.

Tradeoffs:

  • highest availability: model weights will never depend on S3/huggingface uptime. High availability on b10cache.
  • slower cold-starts: docker images may need to be pulled from a slower source that has lower speed S3 or Huggingface.
  • unsuitable for very large-models: We don’t recommend placing large model artifacts into the docker image, and may lead to build failures when larger than 50GB.

Download weights into the image via build_commands

The most flexible way to download weights into the docker image is the usage of custom build_commands. You can read more on build_commands here.

config.yaml
build_commands:
- 'apt-get install git git-lfs'
- 'git lfs install'
- 'git clone https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 /data/local-model'
- echo 'Model downloaded to /data/local-model via git clone'

Download the weights via model_cache and use_volume: false

If you are setting use_volume: false, we will not use b10cache to mount the model weights at runtime, and rather vendor them into the docker image.

Huggingface

config.yaml
model_cache:
  - repo_id: madebyollin/sdxl-vae-fp16-fix
    revision: 207b116dae70ace3637169f1ddd2434b91b3a8cd
    use_volume: false
    allow_patterns:
      - config.json
      - diffusion_pytorch_model.safetensors

Weights will be cached in the default Hugging Face cache directory, ~/.cache/huggingface/hub/models--{your_model_name}/. You can change this directory by setting the HF_HOME or HUGGINGFACE_HUB_CACHE environment variable in your config.yaml.

Read more here.

Huggingface libraries will use this directly.

model.py
from transformers import AutoModel

AutoModel.from_pretrained("madebyollin/sdxl-vae-fp16-fix")

Google Cloud Storage

Google Cloud Storage is a great alternative to Hugging Face when you have a custom model or fine-tune you want to gate, especially if you are already using GCP and care about security and compliance.

Your model_cache should look something like this:

config.yaml
model_cache:
  - repo_id: gs://path-to-my-bucket
    use_volume: false

If you are accessing a public GCS bucket, you can ignore the following steps, but make sure you set appropriate permissions on your bucket. Users should be able to list and view all files. Otherwise, the model build will fail.

For a private GCS bucket, first export your service account key. Rename it to be service_account.json and add it to the data directory of your Truss.

Your file structure should look something like this:

your-truss
|--model
| └── model.py
|--data
|. └── service_account.json

If you are using version control, like git, for your Truss, make sure to add service_account.json to your .gitignore file. You don’t want to accidentally expose your service account key.

Weights will be cached at /app/model_cache/{your_bucket_name}.

Amazon Web Services S3

Another popular cloud storage option for hosting model weights is AWS S3, especially if you’re already using AWS services.

Your model_cache should look something like this:

config.yaml
model_cache:
  - repo_id: s3://path-to-my-bucket
    use_volume: false

If you are accessing a public S3 bucket, you can ignore the subsequent steps, but make sure you set an appropriate policy on your bucket. Users should be able to list and view all files. Otherwise, the model build will fail.

However, for a private S3 bucket, you need to first find your aws_access_key_id, aws_secret_access_key, and aws_region in your AWS dashboard. Create a file named s3_credentials.json. Inside this file, add the credentials that you identified earlier as shown below. Place this file into the data directory of your Truss. The key aws_session_token can be included, but is optional.

Here is an example of how your s3_credentials.json file should look:

{
    "aws_access_key_id": "YOUR-ACCESS-KEY",
    "aws_secret_access_key": "YOUR-SECRET-ACCESS-KEY",
    "aws_region": "YOUR-REGION"
}

Your overall file structure should now look something like this:

your-truss
|--model
| └── model.py
|--data
|. └── s3_credentials.json

When you are generating credentials, make sure that the resulting keys have at minimum the following IAM policy:

{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "s3:GetObject",
                    "s3:ListObjects",
                ],
                "Effect": "Allow",
                "Resource": ["arn:aws:s3:::S3_BUCKET/PATH_TO_MODEL/*"]
            },
            {
                "Action": [
                    "s3:ListBucket",
                ],
                "Effect": "Allow",
                "Resource": ["arn:aws:s3:::S3_BUCKET"]
            }
        ]
    }

If you are using version control, like git, for your Truss, make sure to add s3_credentials.json to your .gitignore file. You don’t want to accidentally expose your service account key.

Weights will be cached at /app/model_cache/{your_bucket_name}.