Cached weights π
Accelerate cold starts and availability by prefetching and caching your weights.
What is a βcold startβ?
Cold startβ is a term used to describe the time taken when a request is received when the model is scaled to 0 until it is ready to handle the first request. This process is a critical factor in allowing your deployments to be responsive to traffic while maintaining your SLAs and lowering your costs. To optimize cold starts, we will go over the following stategies: Downloading them in a background thread in Rust that runs during the module import, caching weights in a distributed filesystem, and moving weights into the docker image.
In practice, this reduces the cold start for large models to just a few seconds. For example, Stable Diffusion XL can take a few minutes to boot up without caching. With caching, it takes just under 10 seconds.
Enabling Caching + Prefetching for a Model
To enable caching, simply add model_cache
to your config.yaml
with a valid repo_id
. The model_cache
has a few key configurations:
repo_id
(required): The repo name from Hugging Face.revision
(required): The revision of the huggingface repo, such as the sha or branch name such asrefs/pr/1
ormain
.use_volume
: Boolean flag to determine if the weights are downloaded to the Baseten Distributed Filesystem at runtime (recommended) or bundled into the container image (legacy, not recommended).volume_folder
: string, folder name under which the model weights appear. Setting it tomy-llama-model
will mount the repo to/app/model_cache/my-llama-model
at runtime.allow_patterns
: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.ignore_patterns
: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.
Here is an example of a well written model_cache
for Stable Diffusion XL. Note how it only pulls the model weights that it needs using allow_patterns
.
Many Hugging Face repos have model weights in different formats (.bin
, .safetensors
, .h5
, .msgpack
, etc.). You only need one of these most of the time. To minimize cold starts, ensure that you only cache the weights you need.
What is weight βpre-fetchingβ?
With model_cache
, weights are pre-fetched by downloading your weights ahead of time in a dedicated Rust thread.
This means, you can perform all kinds of preparation work (importing libraries, jit compilation of torch/triton modules), until you need access to the files.
In practice, executing statements like import tensorrt_llm
typically take 10β15 seconds. By that point, the first 5β10GB of the weights will have already been downloaded.
To use the model_cache
config with truss, we require you to actively interact with the lazy_data_resolver
.
Before using any of the downloaded files, you must call the lazy_data_resolver.block_until_download_complete()
. This will block until all files in the /app/model_cache
directory are downloaded & ready to use.
This call must be either part of your __init__
or load
implementation.
Private Hugging Face repositories π€
For any public Hugging Face repo, you donβt need to do anything else. Adding the model_cache
key with an appropriate repo_id
should be enough.
However, if you want to deploy a model from a gated repo like Llama 2 to Baseten, there are a few steps you need to take:
Get Hugging Face API Key
Grab an API key from Hugging Face with read
access. Make sure you have access to the model you want to serve.
Add it to Baseten Secrets Manager
Paste your API key in your secrets manager in Baseten under the key hf_access_token
. You can read more about secrets here.
Update Config
In your Trussβs config.yaml
, add the following code:
Make sure that the key secrets
only shows up once in your config.yaml
.
If you run into any issues, run through all the steps above again and make sure you did not misspell the name of the repo or paste an incorrect API key.
model_cache
within Chains
To use model_cache
for chains - use the Assets
specifier. In the example below, we will download llama-3.2-1B
.
As this model is a gated huggingface model, we are setting the mounting token as part of the assets chains.Assets(..., secret_keys=["hf_access_token"])
.
The model is quite small - in many cases, we will be able to download the model while from transformers import pipeline
and import torch
are running.
model_cache
for custom servers
If you are not using Pythonβs model.py
and custom servers such as vllm, TEI or sglang,
you are required to use the truss-transfer-cli
command, to force population of the /app/model_cache
location. The command will block until the weights are downloaded.
Here is an example for how to use text-embeddings-inference on a L4 to populate a jina embeddings model from huggingface into the model_cache.
Optimizing access time futher with b10cache enabled
To further reduce weights loading time, we can enable Basetenβs Distributed Filesystem (b10cache) for your account. You can validate that this is enabled for your account by viewing the logs of your deployment.
Once b10cache is active, we will skip downloads that are cached in the distributed filesystem of the region your deployment is running in. b10cache acts like a content delivery network: Initial cache misses are populating the filesystem, unused files are garbage collected 4 days after their last usage. Once b10cache is active, it will pull from the fastest source. If another pod is active on the same physical node, artifacts may be hot-cached, and shared among your deployments. Downloads are fully isolated from other organizations. Modifying downloaded artifacts inplace / without copy is not recommended.
If b10cache is not available for your account, we will provision the model_cache with a optimized download from HuggingFace.co. The download is parallellized, achieving typical download speeds of greater than 1GB/s on a 10Gbit ethernet connection. If you want to enable b10cache, feel free to reach out to our support.