Deploy your first model
From model weights to API endpoint
In this guide, you will package and deploy Phi-3-mini-4k-instruct, a 3.8-billion-parameter large language model.
We’ll cover:
- Loading model weights from Hugging Face
- Running model inference on a GPU
- Configuring your infrastructure and Python environment
- Iterating on your model server in a live reload development environment
- Deploying your finished model serving instance for production use
By the end of this tutorial, you will have built a production-ready API endpoint for an open source LLM on autoscaling infrastructure.
This tutorial is a comprehensive introduction to deploying models from scratch. If you want to quickly deploy an off-the-shelf model, start with our model library and Truss examples.
Setup
Before we dive into the code:
- Sign up for or sign in to your Baseten account.
- Generate an API key and store it securely.
- Install Truss, our open-source model packaging framework.
pip install --upgrade truss
New Baseten accounts come with free credits to experiment with model inference. Completing this tutorial should consume less than a dollar of GPU resources.
What is Truss?
Truss is a framework for writing model serving code in Python and configuring the model’s production environment without touching Docker. It also includes a CLI to power a robust developer experience that will be introduced shortly.
A Truss contains:
- A file
model.py
where theModel
class is implemented as a serving interface for an AI model. - A file
config.yaml
that specifies GPU resources, Python environment, metadata, and more. - Optional folders for bundling model weights (
data/
) and custom dependencies (packages/
).
Truss is designed to map directly from model development code to production-ready model serving code:
Create a Truss
To get started, create a Truss with the following terminal command:
truss init phi-3-mini
When prompted, give your Truss a name like Phi 3 Mini
.
Then, navigate to the newly created directory:
cd phi-3-mini
You should see the following file structure:
phi-3-mini/
data/
model/
__init__.py
model.py
packages/
config.yaml
For this tutorial, we will be editing model/model.py
and config.yaml
.
Load model weights
Phi-3-mini-4k-instruct is an open source LLM available for download on Hugging Face. We’ll access its model weights via the transformers
library.
Two functions in the Model
object, __init__()
and load()
, run exactly once when the model server is spun up or patched. Using these functions, we load model weights and anything else the model server needs for inference.
For Phi 3, we need to load the LLM and its tokenizer. After initializing the necessary instance attributes, we load the weights and tokenzier from Hugging Face:
# We'll bundle these packages with our Truss in a future step
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer
)
class Model:
def __init__(self, **kwargs):
self._model = None
self._tokenizer = None
def load(self):
self._model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct", # Loads model from Hugging Face
device_map="cuda",
torch_dtype="auto"
)
self._tokenizer = AutoTokenizer.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct"
)
Run model inference
The final required function in the Model
class, predict()
, runs each time the model endpoint is requested. The predict()
function handles model inference.
The implementation for predict()
determines what features your model endpoint supports. You can implement anything from streaming to support for specific input and output specs:
class Model:
...
def predict(self, request):
messages = request.pop("messages")
model_inputs = self._tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = self._tokenizer(model_inputs, return_tensors="pt")
input_ids = inputs["input_ids"].to("cuda")
with torch.no_grad():
outputs = self._model.generate(input_ids=input_ids, max_length=256)
output_text = self._tokenizer.decode(outputs[0], skip_special_tokens=True)
return {"output": output_text}
Set Python environment
Now that the model server is implemented, we need to give it an environment to run in. In model/model.py
, we imported a couple of objects from transformers
:
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer
)
To add transformers
, torch
, and other required packages to our Python environment, we move to config.yaml
, the other essential file in every Truss. Here, you can set your Python requirements:
requirements:
- accelerate==0.30.1
- einops==0.8.0
- transformers==4.41.2
- torch==2.3.0
We strongly recommend pinning versions for every Python requirement. The AI/ML ecosystem moves fast, and breaking changes to unpinned dependencies can cause errors in production.
Select a GPU
Picking the right GPU is a balance between performance and cost. First, consider the size of the model weights. A good rule of thumb is that for float16
LLM inference, you need 2GB of VRAM on your GPU for every billion parameters in the model, plus overhead for processing requests.
Phi 3 Mini has 3.8 billion parameters, meaning that it needs 7.6GB of VRAM just to load model weights. An NVIDIA T4 GPU, the smallest and least expensive GPU available on Baseten, has 16GB of VRAM, which will be more than enough to run the model.
To use a T4 in your Truss, update the resources
in config.yaml
:
resources:
accelerator: T4
use_gpu: true
Here’s a list of supported GPUs.
Create a development deployment
With the implementation finished, it’s time to test the packaged model. With Baseten, you can spin up a development deployment, which replicates a production environment but with a live reload system that lets you patch your running model and test changes in seconds.
Get your API key
Retreive your Baseten API key or, if necessary, create one from your workspace..
To use your API key for model inference, we recommend storing it as an enviornment variable:
export BASETEN_API_KEY=<baseten_api_key>
Add this line to your ~/.zshrc
or similar shell config file.
The first time you run truss push
, you’ll be asked to paste in an API key.
Run truss push
To create a development deployment for your model, run the following command in your phi-3-mini
working directory:
truss push
You can monitor your model deployment from your model dashboard on Baseten.
Call the development deployment
Your model deployment will go through three stages:
- Building the model serving environment (creating a Docker container for model serving)
- Deploying the model to the model serving environment (provisioning GPU resources and installing the image)
- Loading the model onto the model server (running the
load()
function)
After deployment is complete, the model will show as “active” in your workspace. You can call the model with:
import requests
import os
model_id = "" # Paste your model ID from your Baseten dashboard
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/development/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={"messages": [{"role": "user", "content": "What even is AGI?"}]}
)
print(resp.json())
Live reload development environment
Even with Baseten’s optimized infrastructure, deploying a model from scratch takes time. If you had to wait for the image to build, GPU to be provisioned, and model environment to be loaded every time you make a change as you test your code, that would be a frustrating and slow developer experience.
Instead, the development environment has live reload. This way, when you make changes to your model, you skip the first two steps of deployment and only need to wait for load()
to run, cutting your dev loop from minutes to seconds.
To activate live reload, in your working directory, run:
truss watch
Now, when you make changes to your model/model.py
or certain parts of your config.yaml
(such as Python requirements), your changes will be patched onto your running model server.
Implementation: generation configs
Let’s implement a few more features into our model object to experience the live reload workflow.
Currently, we only support passing the messages to the model. But LLMs have a number of other parameters like max_length
and temperature
that matter during inference.
To set these appropriately, we’ll use the preprocess()
function in the Model
object. Truss models have optional preprocess()
and postprocess()
functions, which run on the CPU on either side of predict()
, which runs on the GPU.
Add the following function to your Truss:
class Model:
...
def preprocess(self, request):
terminators = [
self._tokenizer.eos_token_id,
self._tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]
generate_args = {
"max_length": request.get("max_tokens", 512),
"temperature": request.get("temperature", 1.0),
"top_p": request.get("top_p", 0.95),
"top_k": request.get("top_k", 40),
"repetition_penalty": request.get("repetition_penalty", 1.0),
"no_repeat_ngram_size": request.get("no_repeat_ngram_size", 0),
"do_sample": request.get("do_sample", True),
"use_cache": True,
"eos_token_id": terminators,
"pad_token_id": self._tokenizer.pad_token_id,
}
request["generate_args"] = generate_args
return request
To use the generation args, we’ll modify our predict()
function as follows:
class Model:
...
def predict(self, request):
messages = request.pop("messages")
+ generation_args = request.pop("generate_args")
model_inputs = self._tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = self._tokenizer(model_inputs, return_tensors="pt")
input_ids = inputs["input_ids"].to("cuda")
with torch.no_grad():
- outputs = self._model.generate(input_ids=input_ids, max_length=256)
+ outputs = self._model.generate(input_ids=input_ids, **generation_args)
return self._tokenizer.decode(outputs[0], skip_special_tokens=True)
Save your model/model.py
file and check your truss watch
logs to see the patch being applied. Once the model status on your model dashboard shows as “active”, you can call the API endpoint again with new parameters:
import requests
import os
model_id = "" # Paste your model ID from your Baseten dashboard
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/development/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"messages": [{"role": "user", "content": "What even is AGI?"}],
"max_tokens": 512,
"temperature": 2.0
}
)
print(resp.json())
Implementation: streaming output
Right now, the model works by returning the entire output at once. For many use cases, we’d rather stream model output, receiving the tokens as they are generated to reduce user-facing latency.
This requires updates to the imports at the top of model/model.py
:
+from threading import Thread
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
+ GenerationConfig,
+ TextIteratorStreamer,
)
We can implement streaming in model/model.py
. We’ll define a function to handle streaming:
class Model:
...
def stream(self, input_ids: list, generation_args: dict):
streamer = TextIteratorStreamer(self._tokenizer)
generation_config = GenerationConfig(**generation_args)
generation_kwargs = {
"input_ids": input_ids,
"generation_config": generation_config,
"return_dict_in_generate": True,
"output_scores": True,
"max_new_tokens": generation_args["max_length"],
"streamer": streamer,
}
with torch.no_grad():
# Begin generation in a separate thread
thread = Thread(target=self._model.generate, kwargs=generation_kwargs)
thread.start()
# Yield generated text as it becomes available
def inner():
for text in streamer:
yield text
thread.join()
return inner()
Then in predict()
, we enable streaming:
class Model:
...
def predict(self, request):
messages = request.pop("messages")
generation_args = request.pop("generate_args")
+ stream = request.pop("stream", True)
model_inputs = self._tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = self._tokenizer(model_inputs, return_tensors="pt")
input_ids = inputs["input_ids"].to("cuda")
+ if stream:
+ return self.stream(input_ids, generation_args)
with torch.no_grad():
outputs = self._model.generate(input_ids=input_ids, **generation_args)
return self._tokenizer.decode(outputs[0], skip_special_tokens=True)
To call the streaming endpoint, update your API call to process the streaming output:
import requests
# Replace the empty string with your model id below
model_id = ""
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Call model endpoint
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/development/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"messages": [{"role": "user", "content": "What even is AGI?"}],
"stream": True,
"max_tokens": 256
},
stream=True
)
# Print the generated tokens as they get streamed
for content in resp.iter_content():
print(content.decode("utf-8"), end="", flush=True)
Promote to production
Now that we’re happy with how our model is implemented, we can promote our deployment to production. Production deployments don’t have live reload, but are suitable for real traffic as they have access to full autoscaling settings and can’t be interrupted by patches or other deployment activities.
You can promote your deployment to production through the Baseten UI or by running:
truss push --publish
When a development deployment is promoted to production, it gets rebuilt and deployed.
Call the production endpoint
When the deployment is running in production, the API endpoint for calling it changes from /development/predict
to /production/predict
. All other inference code remains unchanged:
import requests
# Replace the empty string with your model id below
model_id = ""
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Call model endpoint
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"messages": [{"role": "user", "content": "What even is AGI?"}],
"stream": True,
"max_tokens": 256
},
stream=True
)
# Print the generated tokens as they get streamed
for content in resp.iter_content():
print(content.decode("utf-8"), end="", flush=True)
Both your development and production deployments will scale to zero when not in use.
Learn more
You’ve completed the quickstart by packaging, deploying, and invoking an AI model with Truss!
From here, you may be interested in:
- Learning more about model serving with Truss.
- Example implementations for dozens of open source models.
- Inference examples and Baseten integrations.
- Using autoscaling settings to spin up and down multiple GPU replicas.