Skip to main content
Baseten deploys models from a single config.yaml file. You point to a model on Hugging Face, choose a GPU, and Baseten builds a TensorRT-optimized container with an OpenAI-compatible API. No Python code, no Dockerfile, no container management. This tutorial deploys Qwen 2.5 3B Instruct, a small but capable LLM, to a production-ready endpoint on an L4 GPU.

Set up your environment

To use Truss, install a recent Truss version and ensure pydantic is v2:
pip install --upgrade truss 'pydantic>=2.0.0'
Truss requires python >=3.9,<3.15. To set up a fresh development environment, you can use the following commands, creating a environment named truss_env using pyenv:
curl https://pyenv.run | bash
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc
pyenv install 3.11.0
ENV_NAME="truss_env"
pyenv virtualenv 3.11.0 $ENV_NAME
pyenv activate $ENV_NAME
pip install --upgrade truss 'pydantic>=2.0.0'
To deploy Truss remotely, you also need a Baseten account. It is handy to export your API key to the current shell session or permanently in your .bashrc:
~/.bashrc
export BASETEN_API_KEY="nPh8..."

Log in to Baseten

Generate an API key from Settings > API keys, then authenticate the Truss CLI:
truss login
Paste your API key when prompted.
💻 Let's add a Baseten remote!
🤫 Quietly paste your API_KEY:

Create a Truss project

Scaffold a new project with truss init:
truss init qwen-2.5-3b
When prompted, name the model Qwen 2.5 3B.
? 📦 Name this model: Qwen 2.5 3B
Truss Qwen 2.5 3B was created in ~/qwen-2.5-3b
This creates a directory with a config.yaml, a model/ directory, and supporting files. For engine-based deployments like this one, you only need config.yaml. The model/ directory is for custom Python code, which this deployment doesn’t require.

Write the config

Replace the contents of config.yaml with:
config.yaml
model_name: Qwen-2.5-3B
resources:
  accelerator: L4
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen2.5-3B-Instruct"
    max_seq_len: 8192
    quantization_type: fp8
    tensor_parallel_count: 1
That’s the entire deployment specification. The resources section selects an L4 GPU, which has 24 GB of VRAM. The trt_llm section tells Baseten to use Engine-Builder-LLM, which compiles the model with TensorRT-LLM for optimized inference. The checkpoint_repository points to the model weights on Hugging Face (Qwen 2.5 3B Instruct is ungated, so no access token is needed). Setting quantization_type: fp8 compresses weights to 8-bit floating point, cutting memory usage roughly in half with negligible quality loss.

Deploy

From the project directory, push to Baseten:
cd qwen-2.5-3b && truss push
You should see:
✨ Model Qwen 2.5 3B was successfully pushed ✨

🪵  View logs for your deployment at https://app.baseten.co/models/abc1d2ef/logs/xyz123
The logs URL contains your model ID, the string after /models/ (e.g. abc1d2ef). You’ll need this to call the model’s API. You can also find it in your Baseten dashboard. Baseten now downloads the model weights, compiles them with TensorRT-LLM, and deploys the resulting container to an L4 GPU. You can watch progress in the logs linked above. When the deployment status shows “Active” in the dashboard, it’s ready for requests.
New accounts include free credits. This deployment uses an L4 GPU, one of the most cost-effective options available.

Call your model

Engine-based deployments serve an OpenAI-compatible API, so any code that works with the OpenAI SDK works with your model. Replace {model_id} with your model ID from the deployment output.
Install the OpenAI SDK if you don’t have it:
pip install openai
Create a chat completion:
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/development/sync/v1",
)

response = client.chat.completions.create(
    model="Qwen-2.5-3B",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)
You should see a response like:
Machine learning is a branch of artificial intelligence where systems learn
patterns from data to make predictions or decisions without being explicitly
programmed for each task...

What just happened

With a 12-line config file, you deployed a production-ready LLM endpoint. Here’s what Baseten did:
  1. Downloaded the Qwen 2.5 3B Instruct weights from Hugging Face.
  2. Compiled the model with TensorRT-LLM, applying FP8 quantization for faster inference and lower memory usage.
  3. Packaged everything into a container and deployed it to an L4 GPU.
  4. Exposed an OpenAI-compatible API that handles tokenization, batching, and KV cache management automatically.
No model.py, no Docker setup, no inference server configuration. This config-only pattern works for most popular open-source LLMs, including Llama, Qwen, Mistral, Gemma, and Phi models.

Next steps