In this guide, you will package and deploy Phi-3-mini-4k-instruct, a 3.8-billion-parameter large language model.

We’ll cover:

  1. Loading model weights from Hugging Face
  2. Running model inference on a GPU
  3. Configuring your infrastructure and Python environment
  4. Iterating on your model server in a live reload development environment
  5. Deploying your finished model serving instance for production use

By the end of this tutorial, you will have built a production-ready API endpoint for an open source LLM on autoscaling infrastructure.

This tutorial is a comprehensive introduction to deploying models from scratch. If you want to quickly deploy an off-the-shelf model, start with our model library and Truss examples.

Setup

Before we dive into the code:

  • Sign up for or sign in to your Baseten account.
  • Generate an API key and store it securely.
  • Install Truss, our open-source model packaging framework.
pip install --upgrade truss

New Baseten accounts come with free credits to experiment with model inference. Completing this tutorial should consume less than a dollar of GPU resources.

What is Truss?

Truss is a framework for writing model serving code in Python and configuring the model’s production environment without touching Docker. It also includes a CLI to power a robust developer experience that will be introduced shortly.

A Truss contains:

  • A file model.py where the Model class is implemented as a serving interface for an AI model.
  • A file config.yaml that specifies GPU resources, Python environment, metadata, and more.
  • Optional folders for bundling model weights (data/) and custom dependencies (packages/).

Truss is designed to map directly from model development code to production-ready model serving code:

Create a Truss

To get started, create a Truss with the following terminal command:

truss init phi-3-mini

When prompted, give your Truss a name like Phi 3 Mini.

Then, navigate to the newly created directory:

cd phi-3-mini

You should see the following file structure:

phi-3-mini/
  data/
  model/
    __init__.py
    model.py
  packages/
  config.yaml

For this tutorial, we will be editing model/model.py and config.yaml.

Load model weights

Phi-3-mini-4k-instruct is an open source LLM available for download on Hugging Face. We’ll access its model weights via the transformers library.

Two functions in the Model object, __init__() and load(), run exactly once when the model server is spun up or patched. Using these functions, we load model weights and anything else the model server needs for inference.

For Phi 3, we need to load the LLM and its tokenizer. After initializing the necessary instance attributes, we load the weights and tokenzier from Hugging Face:

model/model.py
# We'll bundle these packages with our Truss in a future step
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer
)


class Model:
    def __init__(self, **kwargs):
        self._model = None
        self._tokenizer = None

    def load(self):
        self._model = AutoModelForCausalLM.from_pretrained(
            "microsoft/Phi-3-mini-4k-instruct", # Loads model from Hugging Face
            device_map="cuda",
            torch_dtype="auto"
        )
        self._tokenizer = AutoTokenizer.from_pretrained(
            "microsoft/Phi-3-mini-4k-instruct"
        )

Run model inference

The final required function in the Model class, predict(), runs each time the model endpoint is requested. The predict() function handles model inference.

The implementation for predict() determines what features your model endpoint supports. You can implement anything from streaming to support for specific input and output specs:

model/model.py
class Model:
    ...
    def predict(self, request):
        messages = request.pop("messages")
        model_inputs = self._tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = self._tokenizer(model_inputs, return_tensors="pt")
        input_ids = inputs["input_ids"].to("cuda")
        with torch.no_grad():
            outputs = self._model.generate(input_ids=input_ids, max_length=256)
            output_text = self._tokenizer.decode(outputs[0], skip_special_tokens=True)
            return {"output": output_text}

Set Python environment

Now that the model server is implemented, we need to give it an environment to run in. In model/model.py, we imported a couple of objects from transformers:

model/model.py
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer
)

To add transformers, torch, and other required packages to our Python environment, we move to config.yaml, the other essential file in every Truss. Here, you can set your Python requirements:

config.yaml
requirements:
  - accelerate==0.30.1
  - einops==0.8.0
  - transformers==4.41.2
  - torch==2.3.0

We strongly recommend pinning versions for every Python requirement. The AI/ML ecosystem moves fast, and breaking changes to unpinned dependencies can cause errors in production.

Select a GPU

Picking the right GPU is a balance between performance and cost. First, consider the size of the model weights. A good rule of thumb is that for float16 LLM inference, you need 2GB of VRAM on your GPU for every billion parameters in the model, plus overhead for processing requests.

Phi 3 Mini has 3.8 billion parameters, meaning that it needs 7.6GB of VRAM just to load model weights. An NVIDIA T4 GPU, the smallest and least expensive GPU available on Baseten, has 16GB of VRAM, which will be more than enough to run the model.

To use a T4 in your Truss, update the resources in config.yaml:

config.yaml
resources:
  accelerator: T4
  use_gpu: true

Here’s a list of supported GPUs.

Create a development deployment

With the implementation finished, it’s time to test the packaged model. With Baseten, you can spin up a development deployment, which replicates a production environment but with a live reload system that lets you patch your running model and test changes in seconds.

Get your API key

Retreive your Baseten API key or, if necessary, create one from your workspace..

To use your API key for model inference, we recommend storing it as an enviornment variable:

export BASETEN_API_KEY=<baseten_api_key>

Add this line to your ~/.zshrc or similar shell config file.

The first time you run truss push, you’ll be asked to paste in an API key.

Run truss push

To create a development deployment for your model, run the following command in your phi-3-mini working directory:

truss push

You can monitor your model deployment from your model dashboard on Baseten.

Call the development deployment

Your model deployment will go through three stages:

  1. Building the model serving environment (creating a Docker container for model serving)
  2. Deploying the model to the model serving environment (provisioning GPU resources and installing the image)
  3. Loading the model onto the model server (running the load() function)

After deployment is complete, the model will show as “active” in your workspace. You can call the model with:

import requests
import os

model_id = "" # Paste your model ID from your Baseten dashboard
baseten_api_key = os.environ["BASETEN_API_KEY"]

resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/development/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={"messages": [{"role": "user", "content": "What even is AGI?"}]}
)

print(resp.json())

Live reload development environment

Even with Baseten’s optimized infrastructure, deploying a model from scratch takes time. If you had to wait for the image to build, GPU to be provisioned, and model environment to be loaded every time you make a change as you test your code, that would be a frustrating and slow developer experience.

Instead, the development environment has live reload. This way, when you make changes to your model, you skip the first two steps of deployment and only need to wait for load() to run, cutting your dev loop from minutes to seconds.

To activate live reload, in your working directory, run:

truss watch

Now, when you make changes to your model/model.py or certain parts of your config.yaml (such as Python requirements), your changes will be patched onto your running model server.

Implementation: generation configs

Let’s implement a few more features into our model object to experience the live reload workflow.

Currently, we only support passing the messages to the model. But LLMs have a number of other parameters like max_length and temperature that matter during inference.

To set these appropriately, we’ll use the preprocess() function in the Model object. Truss models have optional preprocess() and postprocess() functions, which run on the CPU on either side of predict(), which runs on the GPU.

Add the following function to your Truss:

model.py
class Model:
    ...
    def preprocess(self, request):
        terminators = [
            self._tokenizer.eos_token_id,
            self._tokenizer.convert_tokens_to_ids("<|eot_id|>"),
        ]
        generate_args = {
            "max_length": request.get("max_tokens", 512),
            "temperature": request.get("temperature", 1.0),
            "top_p": request.get("top_p", 0.95),
            "top_k": request.get("top_k", 40),
            "repetition_penalty": request.get("repetition_penalty", 1.0),
            "no_repeat_ngram_size": request.get("no_repeat_ngram_size", 0),
            "do_sample": request.get("do_sample", True),
            "use_cache": True,
            "eos_token_id": terminators,
            "pad_token_id": self._tokenizer.pad_token_id,
        }
        request["generate_args"] = generate_args
        return request

To use the generation args, we’ll modify our predict() function as follows:

model.py
class Model:
    ...
    def predict(self, request):
        messages = request.pop("messages")
+       generation_args = request.pop("generate_args")
        model_inputs = self._tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        ) 
        inputs = self._tokenizer(model_inputs, return_tensors="pt")
        input_ids = inputs["input_ids"].to("cuda")
        with torch.no_grad():
-           outputs = self._model.generate(input_ids=input_ids, max_length=256)
+           outputs = self._model.generate(input_ids=input_ids, **generation_args)
            return self._tokenizer.decode(outputs[0], skip_special_tokens=True)

Save your model/model.py file and check your truss watch logs to see the patch being applied. Once the model status on your model dashboard shows as “active”, you can call the API endpoint again with new parameters:

import requests
import os

model_id = "" # Paste your model ID from your Baseten dashboard
baseten_api_key = os.environ["BASETEN_API_KEY"]

resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/development/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
      "messages": [{"role": "user", "content": "What even is AGI?"}],
      "max_tokens": 512,
      "temperature": 2.0
    }
)

print(resp.json())

Implementation: streaming output

Right now, the model works by returning the entire output at once. For many use cases, we’d rather stream model output, receiving the tokens as they are generated to reduce user-facing latency.

This requires updates to the imports at the top of model/model.py:

model.py
+from threading import Thread
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
+   GenerationConfig,
+   TextIteratorStreamer,
)

We can implement streaming in model/model.py. We’ll define a function to handle streaming:

model.py
class Model:
    ...
    def stream(self, input_ids: list, generation_args: dict):
        streamer = TextIteratorStreamer(self._tokenizer)
        generation_config = GenerationConfig(**generation_args)
        generation_kwargs = {
            "input_ids": input_ids,
            "generation_config": generation_config,
            "return_dict_in_generate": True,
            "output_scores": True,
            "max_new_tokens": generation_args["max_length"],
            "streamer": streamer,
        }

        with torch.no_grad():
            # Begin generation in a separate thread
            thread = Thread(target=self._model.generate, kwargs=generation_kwargs)
            thread.start()

            # Yield generated text as it becomes available
            def inner():
                for text in streamer:
                    yield text
                thread.join()

        return inner()

Then in predict(), we enable streaming:

model.py
class Model:
    ...
    def predict(self, request):
        messages = request.pop("messages")
        generation_args = request.pop("generate_args")
+       stream = request.pop("stream", True)
        model_inputs = self._tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        ) 
        inputs = self._tokenizer(model_inputs, return_tensors="pt")
        input_ids = inputs["input_ids"].to("cuda")

+       if stream:
+           return self.stream(input_ids, generation_args)

        with torch.no_grad():
            outputs = self._model.generate(input_ids=input_ids, **generation_args)
            return self._tokenizer.decode(outputs[0], skip_special_tokens=True)

To call the streaming endpoint, update your API call to process the streaming output:

import requests

# Replace the empty string with your model id below
model_id = ""
baseten_api_key = os.environ["BASETEN_API_KEY"]

# Call model endpoint
resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/development/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
      "messages": [{"role": "user", "content": "What even is AGI?"}],
      "stream": True,
      "max_tokens": 256
    },
    stream=True
)

# Print the generated tokens as they get streamed
for content in resp.iter_content():
    print(content.decode("utf-8"), end="", flush=True)

Promote to production

Now that we’re happy with how our model is implemented, we can promote our deployment to production. Production deployments don’t have live reload, but are suitable for real traffic as they have access to full autoscaling settings and can’t be interrupted by patches or other deployment activities.

You can promote your deployment to production through the Baseten UI or by running:

truss push --publish

When a development deployment is promoted to production, it gets rebuilt and deployed.

Call the production endpoint

When the deployment is running in production, the API endpoint for calling it changes from /development/predict to /production/predict. All other inference code remains unchanged:

import requests

# Replace the empty string with your model id below
model_id = ""
baseten_api_key = os.environ["BASETEN_API_KEY"]

# Call model endpoint
resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/production/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
      "messages": [{"role": "user", "content": "What even is AGI?"}],
      "stream": True,
      "max_tokens": 256
    },
    stream=True
)

# Print the generated tokens as they get streamed
for content in resp.iter_content():
    print(content.decode("utf-8"), end="", flush=True)

Both your development and production deployments will scale to zero when not in use.

Learn more

You’ve completed the quickstart by packaging, deploying, and invoking an AI model with Truss!

From here, you may be interested in: