Cached Weights π
Accelerate cold starts and availability by prefetching and caching your weights.
What is a βcold startβ?
Cold startβ is a term used to describe the time taken when a request is received when the model is scaled to 0 until it is ready to handle the first request. This process is a critical factor in allowing your deployments to be responsive to traffic while maintaining your SLAs and lowering your costs. To optimize cold starts, we will go over the following stategies: Downloading them in a background thread in Rust that runs during the module import, caching weights in a distributed filesystem, and moving weights into the docker image.
In practice, this reduces the cold start for large models to just a few seconds. For example, Stable Diffusion XL can take a few minutes to boot up without caching. With caching, it takes just under 10 seconds.
Enabling Caching + Prefetching for a Model
To enable caching, simply add model_cache
to your config.yaml
with a valid repo_id
. The model_cache
has a few key configurations:
repo_id
(required): The repo name from Hugging Face.revision
(required): The revision of the huggingface repo, such as the sha or branch name such asrefs/pr/1
ormain
.use_volume
: Boolean flag to determine if the weights are downloaded to the Baseten Filesystem at runtime (recommended) or bundled into the container image (not recommended).volume_folder
: string, folder name under which the model weights appear. Setting it tomy-llama-model
will mount the repo to/app/model_cache/my-llama-model
at runtime.allow_patterns
: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.ignore_patterns
: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.
Here is an example of a well written model_cache
for Stable Diffusion XL. Note how it only pulls the model weights that it needs using allow_patterns
.
Many Hugging Face repos have model weights in different formats (.bin
, .safetensors
, .h5
, .msgpack
, etc.). You only need one of these most of the time. To minimize cold starts, ensure that you only cache the weights you need.
What is weight βpre-fetchingβ?
With model_cache
, weights are pre-fetched by downloading your weights ahead of time in a dedicated Rust thread.
This means, you can perform all kinds of preparation work (importing libraries, jit compilation of torch/triton modules), until you need access to the files.
In practice, executing statements like import tensorrt_llm
typically take 10β15 seconds. By that point, the first 5β10GB of the weights will have already been downloaded.
To use the model_cache
config with truss, we require you to actively interact with the lazy_data_resolver
.
Before using any of the downloaded files, you must call the lazy_data_resolver.block_until_download_complete()
. This will block until all files in the /app/model_cache
directory are downloaded & ready to use.
This call must be either part of your __init__
or load
implementation.
Private Hugging Face repositories π€
For any public Hugging Face repo, you donβt need to do anything else. Adding the model_cache
key with an appropriate repo_id
should be enough.
However, if you want to deploy a model from a gated repo like Llama 2 to Baseten, there are a few steps you need to take:
Get Hugging Face API Key
Grab an API key from Hugging Face with read
access. Make sure you have access to the model you want to serve.
Add it to Baseten Secrets Manager
Paste your API key in your secrets manager in Baseten under the key hf_access_token
. You can read more about secrets here.
Update Config
In your Trussβs config.yaml
, add the following code:
Make sure that the key secrets
only shows up once in your config.yaml
.
If you run into any issues, run through all the steps above again and make sure you did not misspell the name of the repo or paste an incorrect API key.
model_cache
within Chains
To use model_cache
for chains - use the Assets
specifier. In the example below, we will download llama-3.2-1B
.
As this model is a gated huggingface model, we are setting the mounting token as part of the assets chains.Assets(..., secret_keys=["hf_access_token"])
.
The model is quite small - in many cases, we will be able to download the model while from transformers import pipeline
and import torch
are running.
model_cache
for custom servers
If you are not using Pythonβs model.py
and custom servers such as vllm, TEI or sglang,
you are required to use the truss-transfer-cli
command, to force population of the /app/model_cache
location. The command will block until the weights are downloaded.
Here is an example for how to use text-embeddings-inference on a L4 to populate a jina embeddings model from huggingface into the model_cache.
Optimizing access time futher with b10cache enabled
To further reduce weights loading time, we can enable Basetenβs Distributed Filesystem (b10cache) for your account. You can validate that this is enabled for your account by viewing the logs of your deployment.
Once b10cache is active, we will skip downloads that are cached in the filesystem of the region your deployment is running in. b10cache acts like a content delivery network: Initial cache misses are populating the filesystem, unused files are garbage collected after 14 days. Once b10cache is active, it will pull from the fastest source. If another pod is active on the same physical node, artifacts may be hot-cached, and shared among your deployments. Downloads are fully isolated from other organizations.
If b10cache is not available for your account, we will provision the model_cache with a optimized download from HuggingFace.co. The download is parallellized, achieving typical download speeds of greater than 1GB/s on a 10Gbit ethernet connection. If you want to enable b10cache, feel free to reach out to our support.
Legacy cache - weights in container
A slower way to make sure your weights are always available, is to download them into the docker image at build time. We recommend this only for small models, of up to a size of ~1GB.
Tradeoffs:
- highest availability: model weights will never depend on S3/huggingface uptime. High availability on b10cache.
- slower cold-starts: docker images may need to be pulled from a slower source that has lower speed S3 or Huggingface.
- unsuitable for very large-models: We donβt recommend placing large model artifacts into the docker image, and may lead to build failures when larger than 50GB.
Download weights into the image via build_commands
The most flexible way to download weights into the docker image is the usage of custom build_commands
.
You can read more on build_commands here.
Download the weights via model_cache
and use_volume: false
If you are setting use_volume: false
, we will not use b10cache to mount the model weights at runtime, and rather vendor them into the docker image.
Huggingface
Weights will be cached in the default Hugging Face cache directory, ~/.cache/huggingface/hub/models--{your_model_name}/
. You can change this directory by setting the HF_HOME
or HUGGINGFACE_HUB_CACHE
environment variable in your config.yaml
.
Huggingface libraries will use this directly.
Google Cloud Storage
Google Cloud Storage is a great alternative to Hugging Face when you have a custom model or fine-tune you want to gate, especially if you are already using GCP and care about security and compliance.
Your model_cache
should look something like this:
If you are accessing a public GCS bucket, you can ignore the following steps, but make sure you set appropriate permissions on your bucket. Users should be able to list and view all files. Otherwise, the model build will fail.
For a private GCS bucket, first export your service account key. Rename it to be service_account.json
and add it to the data
directory of your Truss.
Your file structure should look something like this:
If you are using version control, like git, for your Truss, make sure to add service_account.json
to your .gitignore
file. You donβt want to accidentally expose your service account key.
Weights will be cached at /app/model_cache/{your_bucket_name}
.
Amazon Web Services S3
Another popular cloud storage option for hosting model weights is AWS S3, especially if youβre already using AWS services.
Your model_cache
should look something like this:
If you are accessing a public S3 bucket, you can ignore the subsequent steps, but make sure you set an appropriate policy on your bucket. Users should be able to list and view all files. Otherwise, the model build will fail.
However, for a private S3 bucket, you need to first find your aws_access_key_id
, aws_secret_access_key
, and aws_region
in your AWS dashboard. Create a file named s3_credentials.json
. Inside this file, add the credentials that you identified earlier as shown below. Place this file into the data
directory of your Truss.
The key aws_session_token
can be included, but is optional.
Here is an example of how your s3_credentials.json
file should look:
Your overall file structure should now look something like this:
When you are generating credentials, make sure that the resulting keys have at minimum the following IAM policy:
If you are using version control, like git, for your Truss, make sure to add s3_credentials.json
to your .gitignore
file. You donβt want to accidentally expose your service account key.
Weights will be cached at /app/model_cache/{your_bucket_name}
.