Accelerate cold starts and availability by prefetching and caching your weights.
model_cache
to your config.yaml
with a valid repo_id
. The model_cache
has a few key configurations:
repo_id
(required): The repo name from Hugging Face.revision
(required): The revision of the huggingface repo, such as the sha or branch name such as refs/pr/1
or main
.use_volume
: Boolean flag to determine if the weights are downloaded to the Baseten Distributed Filesystem at runtime (recommended) or bundled into the container image (legacy, not recommended).volume_folder
: string, folder name under which the model weights appear. Setting it to my-llama-model
will mount the repo to /app/model_cache/my-llama-model
at runtime.allow_patterns
: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.ignore_patterns
: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.model_cache
for Stable Diffusion XL. Note how it only pulls the model weights that it needs using allow_patterns
.
.bin
, .safetensors
, .h5
, .msgpack
, etc.). You only need one of these most of the time. To minimize cold starts, ensure that you only cache the weights you need.
model_cache
, weights are pre-fetched by downloading your weights ahead of time in a dedicated Rust thread.
This means, you can perform all kinds of preparation work (importing libraries, jit compilation of torch/triton modules), until you need access to the files.
In practice, executing statements like import tensorrt_llm
typically take 10–15 seconds. By that point, the first 5–10GB of the weights will have already been downloaded.model_cache
config with truss, we require you to actively interact with the lazy_data_resolver
.
Before using any of the downloaded files, you must call the lazy_data_resolver.block_until_download_complete()
. This will block until all files in the /app/model_cache
directory are downloaded & ready to use.
This call must be either part of your __init__
or load
implementation.
model_cache
key with an appropriate repo_id
should be enough.
However, if you want to deploy a model from a gated repo like Llama 2 to Baseten, there are a few steps you need to take:
Get Hugging Face API Key
read
access. Make sure you have access to the model you want to serve.Add it to Baseten Secrets Manager
hf_access_token
. You can read more about secrets here.Update Config
config.yaml
, add the following code:secrets
only shows up once in your config.yaml
.model_cache
within Chainsmodel_cache
for chains - use the Assets
specifier. In the example below, we will download llama-3.2-1B
.
As this model is a gated huggingface model, we are setting the mounting token as part of the assets chains.Assets(..., secret_keys=["hf_access_token"])
.
The model is quite small - in many cases, we will be able to download the model while from transformers import pipeline
and import torch
are running.
model_cache
for custom serversmodel.py
and custom servers such as vllm, TEI or sglang,
you are required to use the truss-transfer-cli
command, to force population of the /app/model_cache
location. The command will block until the weights are downloaded.
Here is an example for how to use text-embeddings-inference on a L4 to populate a jina embeddings model from huggingface into the model_cache.