Skip to main content
vLLM supports a wide range of models and performance optimizations. This guide deploys a vLLM model as a custom Docker server on Baseten.

Example: Deploy Qwen 2.5 3B on an L4

This configuration serves Qwen 2.5 3B with vLLM on an L4 GPU. The deployment process is the same for larger models like GLM-4.7. Adjust the resources and start_command to match your model’s requirements.

Set up your environment

Before you deploy a model, you’ll need three setup steps.
1

Create an API key for your Baseten account

Create an API key and save it as an environment variable:
export BASETEN_API_KEY="abcd.123456"
2

Add an access token for Hugging Face

Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
  1. Accept the license for any gated models you wish to access, like Gemma 3.
  2. Create a read-only user access token from your Hugging Face account.
  3. Add the hf_access_token secret to your Baseten workspace.
3

Install Truss in your local development environment

Install Truss and the OpenAI SDK:

Configure the model

Create a directory with a config.yaml file:
mkdir qwen-2-5-3b-vllm
touch qwen-2-5-3b-vllm/config.yaml
Copy the following configuration into config.yaml:
config.yaml
model_metadata:
  example_model_input:
    messages:
      - role: system
        content: "You are a helpful assistant."
      - role: user
        content: "What does Tongyi Qianwen mean?"
    stream: true
    model: Qwen/Qwen2.5-3B-Instruct
    max_tokens: 512
    temperature: 0.6
  tags:
    - openai-compatible
model_name: Qwen 2.5 3B vLLM
base_image:
  image: vllm/vllm-openai:v0.15.1
docker_server:
  start_command: sh -c "truss-transfer-cli && vllm serve /app/model_cache/qwen --served-model-name Qwen/Qwen2.5-3B-Instruct --host 0.0.0.0 --port 8000 --enable-prefix-caching"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
model_cache:
  - repo_id: Qwen/Qwen2.5-3B-Instruct
    revision: aa8e72537993ba99e69dfaafa59ed015b17504d1
    use_volume: true
    volume_folder: qwen
resources:
  accelerator: L4
  use_gpu: true
runtime:
  predict_concurrency: 256
  health_checks:
    restart_check_delay_seconds: 300
    restart_threshold_seconds: 300
    stop_traffic_threshold_seconds: 120
environment_variables:
  hf_access_token: null
The base_image specifies the vLLM Docker image. The model_cache pre-downloads the model from Hugging Face and stores it on a cached volume. At startup, truss-transfer-cli loads the cached weights into /app/model_cache/qwen, then vLLM serves the model with --served-model-name to set the model identifier for the OpenAI-compatible API. The health_checks give the server time to load the model before Baseten checks readiness.

Deploy the model

Push the model to Baseten to start the deployment:
truss push qwen-2-5-3b-vllm --publish
You should see output like:
Deploying truss using L4:1x24 instance type.
Model Qwen 2.5 3B vLLM was successfully pushed.
View logs at https://app.baseten.co/models/XXXXXXX/logs/XXXXXXX
Copy the model URL from the output for the next step.

Call the model

Call the deployed model with the OpenAI client:
call_model.py
import os
from openai import OpenAI

model_url = "https://model-XXXXXXX.api.baseten.co/environments/production/sync/v1"

client = OpenAI(
    base_url=model_url,
    api_key=os.environ.get("BASETEN_API_KEY"),
)

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-3B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What does Tongyi Qianwen mean?"}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")
Replace the model_url with the URL from your deployment output.