Summary

To load a gated or private model from Hugging Face:

  1. Create an access token on your Hugging Face account.
  2. Add the hf_access_token key to your config.yaml secrets and value to your Baseten account.
  3. Add use_auth_token to the appropriate line in model.py.

Example code:

Step-by-step example

BERT base (uncased) is a masked language model that can be used to infer missing words in a sentence.

While the model is publicly available on Hugging Face, we copied it into a gated model to use in this tutorial. The process is the same for using a gated model as it is for a private model.

You can see the code for the finished private model Truss on the right. Keep reading for step-by-step instructions on how to build it.

This example will cover:

  1. Implementing a transformers.pipeline model in Truss
  2. Securely accessing secrets in your model server
  3. Using a gated or private model with an access token

Step 0: Initialize Truss

Get started by creating a new Truss:

truss init private-bert

Give your model a name when prompted, like Private Model Demo. Then, navigate to the newly created directory:

cd private-bert

Step 1: Implement the Model class

BERT base (uncased) is a pipeline model, so it is straightforward to implement in Truss.

In model/model.py, we write the class Model with three member functions:

  • __init__, which creates an instance of the object with a _model property
  • load, which runs once when the model server is spun up and loads the pipeline model
  • predict, which runs each time the model is invoked and handles the inference. It can use any JSON-serializable type as input and output.

Read the quickstart guide for more details on Model class implementation.

model/model.py
from transformers import pipeline


class Model:
    def __init__(self, **kwargs) -> None:
        self._secrets = kwargs["secrets"]
        self._model = None

    def load(self):
        self._model = pipeline(
            "fill-mask",
            model="baseten/docs-example-gated-model"
        )

    def predict(self, model_input):
        return self._model(model_input)

Step 2: Set Python dependencies

Now, we can turn our attention to configuring the model server in config.yaml.

BERT base (uncased) has two dependencies:

config.yaml
requirements:
- torch==2.0.1
- transformers==4.30.2

Always pin exact versions for your Python dependencies. The ML/AI space moves fast, so you want to have an up-to-date version of each package while also being protected from breaking changes.

Step 3: Set required secret

Now itโ€™s time to mix in access to the gated model:

  1. Go to the model page on Hugging Face and accept the terms to access the model.
  2. Create an access token on your Hugging Face account.
  3. Add the hf_access_token key and value to your Baseten workspace secret manager.
  4. In your config.yaml, add the key hf_access_token:
config.yaml
secrets:
  hf_access_token: null

Never set the actual value of a secret in the config.yaml file. Only put secret values in secure places, like the Baseten workspace secret manager.

Step 4: Use access token in load

In model/model.py, you can give your model access to secrets in the init function:

model/model.py
def __init__(self, **kwargs) -> None:
        self._secrets = kwargs["secrets"]
        self._model = None

Then, update the load function with use_auth_token:

model/model.py
self._model = pipeline(
    "fill-mask",
    model="baseten/docs-example-gated-model",
    use_auth_token=self._secrets["hf_access_token"]
)

This will allow the pipeline function to load the specified model from Hugging Face.

Step 5: Deploy the model

Youโ€™ll need a Baseten API key for this step.

We have successfully packaged a gated model as a Truss. Letโ€™s deploy!

Use --trusted with truss push to give the model server access to secrets stored on the remote host.

truss push --trusted

Wait for the model to finish deployment before invoking.

You can invoke the model with:

truss predict -d '"It is a [MASK] world"'