Cached weights 🆕

What is a “cold start”?

Cold start” is a term used to describe the time taken when a request is received when the model is scaled to 0 until it is ready to handle the first request. This process is a critical factor in allowing your deployments to be responsive to traffic while maintaining your SLAs and lowering your costs. To optimize cold starts, we will go over the following stategies: Downloading them in a background thread in Rust that runs during the module import, caching weights in a distributed filesystem, and moving weights into the docker image.In practice, this reduces the cold start for large models to just a few seconds. For example, Stable Diffusion XL can take a few minutes to boot up without caching. With caching, it takes just under 10 seconds.

Enabling Caching + Prefetching for a Model

To enable caching, simply add model_cache to your config.yaml with a valid repo_id. The model_cache has a few key configurations:

repo_id (required): The repo name from Hugging Face.
revision (required): The revision of the huggingface repo, such as the sha or branch name such as refs/pr/1 or main.
use_volume: Boolean flag to determine if the weights are downloaded to the Baseten Distributed Filesystem at runtime (recommended) or bundled into the container image (legacy, not recommended).
volume_folder: string, folder name under which the model weights appear. Setting it to my-llama-model will mount the repo to /app/model_cache/my-llama-model at runtime.
allow_patterns: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.
ignore_patterns: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.

Here is an example of a well written model_cache for Stable Diffusion XL. Note how it only pulls the model weights that it needs using allow_patterns.

config.yaml

model_cache:
  - repo_id: madebyollin/sdxl-vae-fp16-fix
    revision: 207b116dae70ace3637169f1ddd2434b91b3a8cd
    use_volume: true
    volume_folder: sdxl-vae-fp16
    allow_patterns:
      - config.json
      - diffusion_pytorch_model.safetensors
  - repo_id: stabilityai/stable-diffusion-xl-base-1.0
    revision: 462165984030d82259a11f4367a4eed129e94a7b
    use_volume: true
    volume_folder: stable-diffusion-xl-base
    allow_patterns:
      - "*.json"
      - "*.fp16.safetensors"
      - sd_xl_base_1.0.safetensors
  - repo_id: stabilityai/stable-diffusion-xl-refiner-1.0
    revision: 5d4cfe854c9a9a87939ff3653551c2b3c99a4356
    use_volume: true
    volume_folder: stable-diffusion-xl-refiner
    allow_patterns:
      - "*.json"
      - "*.fp16.safetensors"
      - sd_xl_refiner_1.0.safetensors

Many Hugging Face repos have model weights in different formats (.bin, .safetensors, .h5, .msgpack, etc.). You only need one of these most of the time. To minimize cold starts, ensure that you only cache the weights you need.

What is weight “pre-fetching”?

With model_cache, weights are pre-fetched by downloading your weights ahead of time in a dedicated Rust thread. This means, you can perform all kinds of preparation work (importing libraries, jit compilation of torch/triton modules), until you need access to the files. In practice, executing statements like import tensorrt_llm typically take 10–15 seconds. By that point, the first 5–10GB of the weights will have already been downloaded.

To use the model_cache config with truss, we require you to actively interact with the lazy_data_resolver. Before using any of the downloaded files, you must call the lazy_data_resolver.block_until_download_complete(). This will block until all files in the /app/model_cache directory are downloaded & ready to use. This call must be either part of your __init__ or load implementation.

model.py

# <- download is invoked before here.
import torch # this line usually takes 2-5 seconds.
import tensorrt_llm # this line usually takes 10-15 seconds
import onnxruntime # this line usually takes 5-10 seconds

class Model:
    """example usage of `model_cache` in truss"""
    def __init__(self, *args, **kwargs):
        # `lazy_data_resolver` is passed as keyword-argument in init
        self._lazy_data_resolver = kwargs["lazy_data_resolver"]

    def load():
        # work that does not require the download may be done beforehand
        random_vector = torch.randn(1000)
        # important to collect the download before using any incomplete data
        self._lazy_data_resolver.block_until_download_complete()
        # after the call, you may use the /app/model_cache directory and the contents
        torch.load(
            "/app/model_cache/stable-diffusion-xl-base/model.fp16.safetensors"
        )

Private Hugging Face repositories 🤗

For any public Hugging Face repo, you don’t need to do anything else. Adding the model_cache key with an appropriate repo_id should be enough. However, if you want to deploy a model from a gated repo like Llama 2 to Baseten, there are a few steps you need to take:

Get Hugging Face API Key

Grab an API key from Hugging Face with read access. Make sure you have access to the model you want to serve.

Add it to Baseten Secrets Manager

Paste your API key in your secrets manager in Baseten under the key hf_access_token. You can read more about secrets here.

Update Config

In your Truss’s config.yaml, add the following code:

config.yaml

secrets:
  hf_access_token: null

Make sure that the key secrets only shows up once in your config.yaml.

If you run into any issues, run through all the steps above again and make sure you did not misspell the name of the repo or paste an incorrect API key.

`model_cache` within Chains

To use model_cache for chains - use the Assets specifier. In the example below, we will download llama-3.2-1B. As this model is a gated huggingface model, we are setting the mounting token as part of the assets chains.Assets(..., secret_keys=["hf_access_token"]). The model is quite small - in many cases, we will be able to download the model while from transformers import pipeline and import torch are running.

chain_cache.py

import random
import truss_chains as chains

try:
    # imports on global level for PoemGeneratorLM, to save time during the download.
    from transformers import pipeline
    import torch
except ImportError:
    # RandInt does not have these dependencies.
    pass

class RandInt(chains.ChainletBase):
    async def run_remote(self, max_value: int) -> int:
        return random.randint(1, max_value)

@chains.mark_entrypoint
class PoemGeneratorLM(chains.ChainletBase):
    from truss import truss_config
    LLAMA_CACHE = truss_config.ModelRepo(
        repo_id="meta-llama/Llama-3.2-1B-Instruct",
        revision="c4219cc9e642e492fd0219283fa3c674804bb8ed",
        use_volume=True,
        volume_folder="llama_mini",
        ignore_patterns=["*.pth", "*.onnx"]
    )
    remote_config = chains.RemoteConfig(
        docker_image=chains.DockerImage(
            # The phi model needs some extra python packages.
            pip_requirements=[
                "transformers==4.48.0",
                "torch==2.6.0",
            ]
        ),
        compute=chains.Compute(
            gpu="L4"
        ),
        # The phi model needs a GPU and more CPUs.
        # compute=chains.Compute(cpu_count=2, gpu="T4"),
        # Cache the model weights in the image
        assets=chains.Assets(cached=[LLAMA_CACHE], secret_keys=["hf_access_token"]),
    )
    # <- Download happens before __init__ is called.
    def __init__(self, rand_int=chains.depends(RandInt, retries=3)) -> None:
        self._rand_int = rand_int
        print("loading cached llama_mini model")
        self.pipeline = pipeline(
            "text-generation",
            model=f"/app/model_cache/llama_mini",
        )

    async def run_remote(self, max_value: int = 3) -> str:
        num_repetitions = await self._rand_int.run_remote(max_value)
        print("writing poem with num_repetitions", num_repetitions)
        poem = str(self.pipeline(
            text_inputs="Write a beautiful and descriptive poem about the ocean. Focus on its vastness, movement, and colors.",
            max_new_tokens=150,
            do_sample=True,
            return_full_text=False,
            temperature=0.7,
            top_p=0.9,
        )[0]['generated_text'])
        return poem * num_repetitions

`model_cache` for custom servers

If you are not using Python’s model.py and custom servers such as vllm, TEI or sglang, you are required to use the truss-transfer-cli command, to force population of the /app/model_cache location. The command will block until the weights are downloaded. Here is an example for how to use text-embeddings-inference on a L4 to populate a jina embeddings model from huggingface into the model_cache.

config.yaml

base_image:
  image: baseten/text-embeddings-inference-mirror:89-1.6
docker_server:
  liveness_endpoint: /health
  predict_endpoint: /v1/embeddings
  readiness_endpoint: /health
  server_port: 7997
  # using `truss-transfer-cli` to download the weights to `cached_model`
  start_command: bash -c "truss-transfer-cli && text-embeddings-router --port 7997
    --model-id /app/model_cache/my_jina --max-client-batch-size 128 --max-concurrent-requests
    128 --max-batch-tokens 16384 --auto-truncate"
model_cache:
- repo_id: jinaai/jina-embeddings-v2-base-code
  revision: 516f4baf13dec4ddddda8631e019b5737c8bc250
  use_volume: true
  volume_folder: my_jina
  ignore_patterns: ["*.onnx"]
model_metadata:
  example_model_input:
    encoding_format: float
    input: text string
    model: model
model_name: TEI-jinaai-jina-embeddings-v2-base-code-truss-example
resources:
  accelerator: L4

Optimizing access time futher with b10cache enabled

b10cache is currently in beta mode

To further reduce weights loading time, we can enable Baseten’s Distributed Filesystem (b10cache) for your account. You can validate that this is enabled for your account by viewing the logs of your deployment.

[2025-09-10 01:04:35] [INFO ] b10cache is enabled.
[2025-09-10 01:04:35] [INFO ] Symlink created successfully. Skipping download for /app/model_cache/cached_model/model.safetensors

Once b10cache is active, we will skip downloads that are cached in the distributed filesystem of the region your deployment is running in. b10cache acts like a content delivery network: Initial cache misses are populating the filesystem, unused files are garbage collected 4 days after their last usage. Once b10cache is active, it will pull from the fastest source. If another pod is active on the same physical node, artifacts may be hot-cached, and shared among your deployments. Downloads are fully isolated from other organizations. Modifying downloaded artifacts inplace / without copy is not recommended.

If b10cache is not available for your account, we will provision the model_cache with a optimized download from HuggingFace.co. The download is parallellized, achieving typical download speeds of greater than 1GB/s on a 10Gbit ethernet connection. If you want to enable b10cache, feel free to reach out to our support.

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

Cached weights 🆕

What is a “cold start”?

Enabling Caching + Prefetching for a Model

What is weight “pre-fetching”?

Private Hugging Face repositories 🤗

`model_cache` within Chains

`model_cache` for custom servers

Optimizing access time futher with b10cache enabled

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

​What is a “cold start”?

​Enabling Caching + Prefetching for a Model

​What is weight “pre-fetching”?

​Private Hugging Face repositories 🤗

​model_cache within Chains

​model_cache for custom servers

​Optimizing access time futher with b10cache enabled

What is a “cold start”?

Enabling Caching + Prefetching for a Model

What is weight “pre-fetching”?

Private Hugging Face repositories 🤗

`model_cache` within Chains

`model_cache` for custom servers

Optimizing access time futher with b10cache enabled