Deploying a model to Baseten turns your model code into a production-ready API endpoint. You package your model with Truss, push it to Baseten, and receive a URL you can call from any application.
This guide walks through deploying Phi-3-mini-4k-instruct, a 3.8B parameter LLM, from local code to a production API. You’ll create a Truss project, write model code, configure dependencies and GPU resources, deploy to Baseten, and call your model’s API endpoint.
Set up your environment
Before you begin, sign up or sign in to Baseten.
Install Truss
Truss is Baseten’s model packaging framework. It handles containerization, dependencies, and deployment configuration.
Using a virtual environment is recommended to avoid dependency conflicts with other Python projects.
uv (recommended)
pip (macOS/Linux)
pip (Windows)
uv is a fast Python package manager. These commands create a virtual environment, activate it, and install Truss:uv venv && source .venv/bin/activate
uv pip install truss
These commands create a virtual environment, activate it, and install Truss:python -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
These commands create a virtual environment, activate it, and install Truss:python -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
New accounts include free credits; this guide should use less than $1 in GPU
costs.
Create a Truss
A Truss packages your model into a deployable container with all dependencies and configurations.
Create a new Truss:
truss init phi-3-mini && cd phi-3-mini
When prompted, give your Truss a name like Phi 3 Mini.
This command scaffolds a project with the following structure:
phi-3-mini/
model/
__init__.py
model.py
config.yaml
data/
packages/
The key files are:
model/model.py: Your model code with load() and predict() methods.
config.yaml: Dependencies, resources, and deployment settings.
data/: Optional directory for data files bundled with your model.
packages/: Optional directory for local Python packages.
Truss uses this structure to build and deploy your model automatically. You
define your model in model.py and your infrastructure in config.yaml, no
Dockerfiles or container management required.
Implement model code
In this example, you’ll implement the model code for
Phi-3-mini-4k-instruct.
You’ll use the transformers library to load the model and tokenizer and PyTorch to run inference.
Replace the contents of model/model.py with the following code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class Model:
def __init__(self, **kwargs):
self._model = None
self._tokenizer = None
def load(self):
self._model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
device_map="cuda",
torch_dtype="auto"
)
self._tokenizer = AutoTokenizer.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct"
)
def predict(self, request):
messages = request.pop("messages")
model_inputs = self._tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = self._tokenizer(model_inputs, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = self._model.generate(input_ids=inputs["input_ids"], max_length=256)
return {"output": self._tokenizer.decode(outputs[0], skip_special_tokens=True)}
Truss models follow a three-method pattern that separates initialization from inference:
| Method | When it’s called | What to do here |
|---|
__init__ | Once when the class is created | Initialize variables, store configuration, set secrets |
load | Once at startup, before any requests | Load model weights, tokenizers, and other heavy resources |
predict | On every API request | Process input, run inference, return response |
Why separate load from __init__?
The load method runs during the container’s cold start, before your model
receives traffic. This keeps expensive operations (like downloading
large model weights) out of the request path.
Understand the request/response flow
The predict method receives request, a dictionary containing the JSON body
from the API call:
# API call with: {"messages": [{"role": "user", "content": "Hello"}]}
def predict(self, request):
messages = request.pop("messages") # Extract from request
# ... run inference ...
return {"output": result} # Return dict becomes JSON response
Whatever dictionary you return becomes the API response. You control the input
parameters and output format.
GPU and memory patterns
A few patterns in this code are common across GPU models:
device_map="cuda": Loads model weights directly to GPU.
.to("cuda"): Moves input tensors to GPU for inference.
torch.no_grad(): Disables gradient tracking to save memory (gradients aren’t needed for inference).
The config.yaml file defines your model’s environment and compute resources.
This configuration determines how your container is built and what hardware it
runs on.
Set Python version and dependencies
python_version: py311
requirements:
- six==1.17.0
- accelerate==0.30.1
- einops==0.8.0
- transformers==4.41.2
- torch==2.3.0
Key configuration options:
| Field | Purpose | Example |
|---|
python_version | Python version for your container | py39, py310, py311, py312 |
requirements | Python packages to install (pip format) | torch==2.3.0 |
system_packages | System-level dependencies (apt packages) | ffmpeg, libsm6 |
For the complete list of configuration options, see the Truss configuration reference.
Always pin exact versions (e.g., torch==2.3.0 not torch>=2.0). This
ensures reproducible builds and your model behaves the same way every time it’s
deployed.
Allocate a GPU
The resources section specifies what hardware your model runs on:
resources:
accelerator: T4
use_gpu: true
Choosing the right GPU: Match your GPU to your model’s VRAM requirements. For Phi-3-mini (~7.6GB), a T4 (16GB) provides headroom for inference.
| GPU | VRAM | Good for |
|---|
| T4 | 16GB | Small models, embeddings, fine-tuned models |
| L4 | 24GB | Medium models (7B parameters) |
| A10G | 24GB | Medium models, image generation |
| A100 | 40/80GB | Large models (13B-70B parameters) |
| H100 | 80GB | Very large models, high throughput |
Estimating VRAM: A rough rule is 2GB of VRAM per billion parameters for float16 models. A 7B model needs ~14GB VRAM minimum.
Deploy the model
Authenticate with Baseten
First, generate an API key from the Baseten settings. Then log in:
The expected output is:
💻 Let's add a Baseten remote!
🤫 Quietly paste your API_KEY:
Paste your API key when prompted. Truss saves your credentials for future deployments.
Push your model to Baseten
The expected output is:
Deploying truss using T4x4x16 instance type.
✨ Model Phi 3 Mini was successfully pushed ✨
| --------------------------------------------------------------------------------------- |
| Y-------------------------------------------------------------------------------------o |
| iterate quickly during the deployment process. |
| |
| When you are ready to publish your deployed model as a new deployment, |
| pass '--publish' to the 'truss push' command. To monitor changes to your model and |
| rapidly iterate, run the 'truss watch' command. |
| --------------------------------------------------------------------------------------- |
🪵 View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123
In this example, the logs URL contains two IDs:
- Model ID: The string after
/models/ (e.g., abc1d2ef) which you’ll use this to call the model API.
- Deployment ID: The string after
/logs/ (e.g., xyz123) identifies this specific deployment.
You can also find your model ID in your Baseten dashboard by clicking on your model.
Call the model API
After the deployment is complete, you can call the model API:
From your Truss project directory, run:truss predict --data '{"messages": [{"role": "user", "content": "What is AGI?"}]}'
The expected output is:Calling predict on development deployment...
{
"output": "AGI stands for Artificial General Intelligence..."
}
The Truss CLI uses your saved credentials and automatically targets the correct deployment. Set your API key and replace YOUR_MODEL_ID with your model ID (e.g., abc1d2ef):export BASETEN_API_KEY=YOUR_API_KEY
curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/development/predict \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "What is AGI?"}]}'
The expected output is:{'output': 'AGI stands for Artificial General Intelligence...'}
Set your API key as an environment variable, then replace YOUR_MODEL_ID with your model ID:export BASETEN_API_KEY=YOUR_API_KEY
import requests
import os
model_id = "YOUR_MODEL_ID" # Replace with your model ID (e.g., "abc1d2ef")
baseten_api_key = os.environ["BASETEN_API_KEY"]
resp = requests.post(
f"https://model-{model_id}.api.baseten.co/development/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"messages": [
{"role": "user", "content": "What is AGI?"}
]
}
)
print(resp.json())
The expected output is:{'output': 'AGI stands for Artificial General Intelligence...'}
Use live reload for development
To avoid long deploy times when testing changes, use live reload:
The expected output is:
🪵 View logs for your deployment at https://app.baseten.co/models/<model_id>/logs/<deployment_id>
🚰 Attempting to sync truss with remote
No changes observed, skipping patching.
👀 Watching for changes to truss...
When you save changes to model.py, Truss automatically patches the deployed model:
Changes detected, creating patch...
Created patch to update model code file: model/model.py
Model Phi 3 Mini patched successfully.
This saves time by patching only the updated code without rebuilding Docker containers or restarting the model server.
Once you’re happy with the model, deploy it to production:
This changes the API endpoint from /development/predict to /production/predict:
curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/production/predict \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "What is AGI?"}]}'
To call your production endpoint, you need your model ID. The output of truss push --publish includes a logs URL:🪵 View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123
Your model ID is the string after /models/ (e.g., abc1d2ef). You can also find it in your Baseten dashboard.
Next steps
Now that you’ve deployed your first model, continue learning: