Skip to main content

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
Pick the model you want to deploy. Each tab is a self-contained recipe.
Qwen/Qwen3.5-4B is a 4B-parameter dense model with up to 256K context.This preset serves Qwen3.5-4B with BF16 weights on a single H100, optimized for low time-to-first-token.

Hardware

H100 × 1

Engine

vLLM 0.18.0

Context

32K

Concurrency

128

Write the config

Create and move into the project directory:
mkdir qwen3.5-4b-latency && cd qwen3.5-4b-latency
Then create a file named config.yaml and paste the following:
config.yaml
model_name: "model:qwen3.5-4b preset:latency"
model_metadata:
  example_model_input:
    model: "Qwen/Qwen3.5-4B"
    messages:
      - role: user
        content: "What is the capital of France?"
    max_tokens: 100
    temperature: 0.7
base_image:
  image: vllm/vllm-openai:v0.18.0
weights:
  - source: "hf://Qwen/Qwen3.5-4B@main"
    mount_location: "/app/checkpoint/qwen3.5-4b"
    auth_secret_name: "hf_access_token"
build_commands: []
docker_server:
  start_command: >-
    sh -c "vllm serve /app/checkpoint/qwen3.5-4b
    --served-model-name Qwen/Qwen3.5-4B
    --host 0.0.0.0
    --port 8000
    --gpu-memory-utilization 0.95
    --max-model-len 32768
    --dtype bfloat16
    --reasoning-parser qwen3
    --enable-auto-tool-choice
    --tool-call-parser qwen3_coder
    --trust-remote-code"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
environment_variables:
  HF_HUB_ENABLE_HF_TRANSFER: '1'
  VLLM_LOGGING_LEVEL: WARNING
runtime:
  predict_concurrency: 128
resources:
  accelerator: H100:1
  use_gpu: true
secrets:
  hf_access_token: null

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:
FlagValueWhat it does
--gpu-memory-utilization0.95Fraction of GPU memory vLLM may use for weights and KV cache.
--max-model-len32768Maximum context length (tokens) the server accepts per request.
--dtypebfloat16Weight precision loaded at runtime. bfloat16: BF16 weights, no quantization.
--reasoning-parserqwen3Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6).
--enable-auto-tool-choice(no value)Let the model choose when to call tools without requiring tool_choice: "required".
--tool-call-parserqwen3_coderServer-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format.
--trust-remote-code(no value)Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model qwen3.5-4b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:
main.py
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-4B",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)
To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate reasoning_content field on the response:
response = client.chat.completions.create(
    model="Qwen/Qwen3.5-4B",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer
To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-4B",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)