Enabling Caching + Prefetching for a Model
To enable caching, simply addmodel_cache to your config.yaml with a valid repo_id. The model_cache has a few key configurations:
repo_id(required): The repo name from Hugging Face or bucket/container from GCS, S3, or Azure.revision(required for Hugging Face): The revision of the huggingface repo, such as the sha or branch name such asrefs/pr/1ormain. Not needed for GCS, S3, or Azure.use_volume: Boolean flag to determine if the weights are downloaded to the Baseten Distributed Filesystem at runtime (recommended) or bundled into the container image (legacy, not recommended).volume_folder: string, folder name under which the model weights appear. Setting it tomy-llama-modelwill mount the repo to/app/model_cache/my-llama-modelat runtime.allow_patterns: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.ignore_patterns: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.runtime_secret_name: The name of your secret containing the credentials for a private repository or bucket, such as ahf_access_tokenorgcs_service_account.kind: The storage provider type for the model weights."hf"(default): Hugging Face"gcs": Google Cloud Storage"s3": AWS S3"azure": Azure Blob Storage
model_cache for Stable Diffusion XL. Note how it only pulls the model weights that it needs using allow_patterns.
config.yaml
.bin, .safetensors, .h5, .msgpack, etc.). You only need one of these most of the time. To minimize cold starts, ensure that you only cache the weights you need.
To use the model_cache config with truss, we require you to actively interact with the lazy_data_resolver.
Before using any of the downloaded files, you must call the lazy_data_resolver.block_until_download_complete(). This will block until all files in the /app/model_cache directory are downloaded & ready to use.
This call must be either part of your __init__ or load implementation.
model.py
Private Repositories/Cloud Storage
Private Hugging Face repositories π€
For any public Hugging Face repo, you donβt need to do anything else. Adding themodel_cache key with an appropriate repo_id should be enough.
However, if you want to deploy a model from a gated repo like Llama 2 to Baseten, there are a few steps you need to take:
1
Get Hugging Face API Key
Grab an API key from Hugging Face with
read access. Make sure you have access to the model you want to serve.2
Add it to Baseten Secrets Manager
Paste your API key in your secrets manager in Baseten under the specified key, such as
hf_access_token. You can read more about secrets here.3
Update Config
In your Trussβs
config.yaml, add the secret key under runtime_secret_name:config.yaml
Private GCS Buckets
If you want to deploy a model from a private GCS bucket to Baseten, there are a few steps you need to take:1
Get GCS Service Account Key
Create a service account key in your GCS account for the project which contains the model weights.
2
Add it to Baseten Secrets Manager
Paste the contents of the
service_account.json in your secrets manager in Baseten under the specified key, e.g. gcs_service_account. You can read more about secrets here.At a minimum, you should have these credentials:gcs_service_account
3
Update Config
In your Trussβs
config.yaml, make sure to add the runtime_secret_name to your model_cache matching the above secret name:config.yaml
Private S3 Buckets
If you want to deploy a model from a private S3 bucket to Baseten, there are a few steps you need to take:1
Get S3 credentials
Get the your
access_key_id and secret_access_key in your AWS account for the bucket that contains the model weights.2
Add them to Baseten Secrets Manager
Paste the following
json in your secrets manager in Baseten under the specified key, e.g. aws_secret_json. You can read more about secrets here.aws_secret_json
3
Update Config
In your Trussβs
config.yaml, make sure to add the runtime_secret_name to your model_cache matching the above secret name:config.yaml
Private Azure Containers
If you want to deploy a model from a private Azure container to Baseten, there are a few steps you need to take:1
Get Azure credentials
Get the your
account_key in your Azure account with the container that has the model weights.2
Add them to Baseten Secrets Manager
Paste the following
json in your secrets manager in Baseten under the specified key, e.g. azure_secret_json. You can read more about secrets here.azure_secret_json
3
Update Config
In your Trussβs
config.yaml, make sure to add the runtime_secret_name to your model_cache matching the above secret name:config.yaml
model_cache within Chains
To use model_cache for chains - use the Assets specifier. In the example below, we will download llama-3.2-1B.
As this model is a gated huggingface model, we are setting the mounting token as part of the assets chains.Assets(..., secret_keys=["hf_access_token"]).
The model is quite small - in many cases, we will be able to download the model while from transformers import pipeline and import torch are running.
chain_cache.py
model_cache for custom servers
If you are not using Pythonβs model.py and custom servers such as vllm, TEI or sglang,
you are required to use the truss-transfer-cli command, to force population of the /app/model_cache location. The command will block until the weights are downloaded.
Here is an example for how to use text-embeddings-inference on a L4 to populate a jina embeddings model from huggingface into the model_cache.
config.yaml