Fast Cold Starts with Cached Weights
Deploy a language model, with the model weights cached at build time
View on Github
In this example, we go through a Truss that serves an LLM, and caches the weights at build time. Loading model weights for any model can often be the most time-consuming part of starting a model. Caching the weights at build time means that the weights will be baked into the Truss image, and will be available immediately when your model replica starts. This means that cold starts will be significantly faster with this approach.
Implementing the Model
class
With weight caching, you donβt have to change anything about how the Model
class
is implemented to take advantage of the weight caching.
Setting up the config.yaml
The config.yaml
file is where you need to include the changes to
actually cache the weights at build time.
Configuring the model_cache
To cache model weights, set the model_cache
key.
The repo_id
field allows you to specify a Huggingface
repo to pull down and cache at build-time, and the ignore_patterns
field allows you to specify files to ignore. If this is specified, then
this repo wonβt have to be pulled during runtime.
Check out the guide for more info.
The remaining config options are again, similar to what you would configure for the model without the weight caching.
Deploy the model
Deploy the model like you would other Trusses, with:
The build step will take longer than with the normal Llama Truss, since bundling the model weights is now happening during the build. The deploy step & scale-ups will happen much faster with this approach.
You can then invoke the model with: