Deploy Llama 2 with Caching
Enable fast cold starts for a model with private Hugging Face weights
In this example, we will cover how you can use the model_cache
key in your Truss’s config.yml
to automatically bundle model weights from a private Hugging Face repo.
Bundling model weights can significantly reduce cold start times because your instance won’t waste time downloading the model weights from Hugging Face’s servers.
We use Llama-2-7b
, a popular open-source large language model, as an example. In order to follow along with us, you need to request access to Llama 2.
- First, sign up for a Hugging Face account if you don’t already have one.
- Request access to Llama 2 from Meta’s website.
- Next, request access to Llama 2 on Hugging Face by clicking the “Request access” button on the model page.
If you want to deploy on Baseten, you also need to create a Hugging Face API token and add it to your organizations’s secrets.
- Create a Hugging Face API token and copy it to your clipboard.
- Add the token with the key
hf_access_token
to your organization’s secrets on Baseten.
Step 0: Initialize Truss
Get started by creating a new Truss:
Select the TrussServer
option then hit y
to confirm Truss creation. Then navigate to the newly created directory:
Step 1: Implement Llama 2 7B in Truss
Next, we’ll fill out the model.py
file to implement Llama 2 7B in Truss.
In model/model.py
, we write the class Model
with three member functions:
__init__
, which creates an instance of the object with a_model
propertyload
, which runs once when the model server is spun up and loads thepipeline
modelpredict
, which runs each time the model is invoked and handles the inference. It can use any JSON-serializable type as input and output.
We will also create a helper function format_prompt
outside of the Model
class to appropriately format the incoming text according to the Llama 2 specification.
Read the quickstart guide for more details on Model
class implementation.
Step 2: Set Python dependencies
Now, we can turn our attention to configuring the model server in config.yaml
.
In addition to transformers
, Llama 2 has three other dependencies. We list them below as follows:
Always pin exact versions for your Python dependencies. The ML/AI space moves fast, so you want to have an up-to-date version of each package while also being protected from breaking changes.
Step 3: Configure Hugging Face caching
Finally, we can configure Hugging Face caching in config.yaml
by adding the model_cache
key. When building the image for your Llama 2 deployment, the Llama 2 model weights will be downloaded and cached for future use.
In this configuration:
meta-llama/Llama-2-7b-chat-hf
is therepo_id
, pointing to the exact model to cache.- We use a wild card to ignore all
.bin
files in the model directory by providing a pattern underignore_patterns
. This is because the model weights are stored in.bin
andsafetensors
format, and we only want to cache thesafetensors
files.
Step 4: Deploy the model
You’ll need a Baseten API key for this step. Make sure you added your HUGGING_FACE_HUB_TOKEN
to your organization’s secrets.
We have successfully packaged Llama 2 as a Truss. Let’s deploy!
Step 5: Invoke the model
You can invoke the model with: