Deploying a model to Baseten turns a Hugging Face model into a production-ready API endpoint. You write a config.yaml that specifies the model, the hardware, and the engine, then truss push builds a TensorRT-optimized container and deploys it. No Python code, no Dockerfile, no container management.
This guide walks through deploying Qwen 2.5 3B Instruct, a small but capable LLM, from a config file to a production API. You’ll set up Truss, write a config, deploy to Baseten, call the model’s OpenAI-compatible endpoint, and promote to production.
Set up your environment
Before you begin, sign up or sign in to Baseten.
Install Truss
Truss is Baseten’s open-source framework for packaging models into deployable containers.
uv (recommended)
pip (macOS/Linux)
pip (Windows)
uv is a fast Python package manager:uv venv && source .venv/bin/activate
uv pip install truss
These commands create a virtual environment, activate it, and install Truss:python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade truss
These commands create a virtual environment, activate it, and install Truss:python3 -m venv .venv && .venv\Scripts\activate
pip install --upgrade truss
Authenticate with Baseten
Generate an API key from Settings > API keys, then log in:
Paste your API key when prompted:
💻 Let's add a Baseten remote!
🤫 Quietly paste your API_KEY:
You can skip the interactive prompt by setting BASETEN_API_KEY as an environment variable:export BASETEN_API_KEY="paste-your-api-key-here"
New accounts include free credits. This guide uses an L4 GPU, one of the most cost-effective options available.
Create a Truss project
Scaffold a new project:
truss init qwen-2.5-3b && cd qwen-2.5-3b
When prompted, name the model Qwen 2.5 3B.
? 📦 Name this model: Qwen 2.5 3B
Truss Qwen 2.5 3B was created in ~/qwen-2.5-3b
This creates a directory with a config.yaml, a model/ directory, and supporting files. For engine-based deployments like this one, you only need config.yaml. The model/ directory is for custom Python code when you need custom preprocessing, postprocessing, or unsupported model architectures.
Write the config
Replace the contents of config.yaml with:
model_name: Qwen-2.5-3B
resources:
accelerator: L4
use_gpu: true
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-3B-Instruct"
max_seq_len: 8192
quantization_type: fp8
tensor_parallel_count: 1
That’s the entire deployment specification.
model_name identifies the model in your Baseten dashboard.
resources selects an L4 GPU (24 GB VRAM), which is plenty for a 3B parameter model.
trt_llm tells Baseten to use Engine-Builder-LLM, which compiles the model with TensorRT-LLM for optimized inference.
checkpoint_repository points to the model weights on Hugging Face. Qwen 2.5 3B Instruct is ungated, so no access token is needed.
quantization_type: fp8 compresses weights to 8-bit floating point, cutting memory usage roughly in half with negligible quality loss.
max_seq_len: 8192 sets the maximum context length for requests.
Deploy
Push the model to Baseten:
We’ll start by deploying in development mode so we can iterate quickly:
You should see:
✨ Model Qwen 2.5 3B was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123
👀 Watching for changes to truss...
The logs URL contains your model ID, the string after /models/ (e.g., abc1d2ef). You’ll need this to call the model’s API. You can also find it in your Baseten dashboard.
Baseten now downloads the model weights from Hugging Face, compiles them with TensorRT-LLM, and deploys the resulting container to an L4 GPU. You can watch progress in the logs linked above.
Call the model
Engine-based deployments serve an OpenAI-compatible API. Once the deployment shows “Active” in the dashboard, call it using the OpenAI SDK or cURL. Replace {model_id} with your model ID from the deployment output.
Install the OpenAI SDK if you don’t have it:Create a chat completion:import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/development/sync/v1",
)
response = client.chat.completions.create(
model="Qwen-2.5-3B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/development/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen-2.5-3B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
You should see a response like:
Machine learning is a branch of artificial intelligence where systems learn
patterns from data to make predictions or decisions without being explicitly
programmed for each task...
Any code that works with the OpenAI SDK works with your deployment. Just point the base_url at your model’s endpoint.
Iterate with live reload
When you change your config.yaml and want to test quickly, use live reload:
You should see:
🪵 View logs for your deployment at https://app.baseten.co/models/<model_id>/logs/<deployment_id>
🚰 Attempting to sync truss with remote
No changes observed, skipping patching.
👀 Watching for changes to truss...
When you save changes, Truss automatically syncs them with the deployed model. This saves time by patching without a full rebuild.
If you stopped the watch session, you can re-attach with:
This creates a production deployment with its own endpoint. The API URL changes from /environments/development/ to /environments/production/:
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen-2.5-3B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
Your model ID is the string after /models/ in the logs URL from truss push. You can also find it in your Baseten dashboard.
Next steps