Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
Pick the model you want to deploy. Each tab is a self-contained recipe.
Qwen/Qwen3.6-27B is a 27B-parameter dense model with up to 256K context.This preset serves Qwen3.6-27B on H100:4 with MTP speculative decoding, optimized for low time-to-first-token on interactive chat and agent workflows.

Hardware

H100 × 4

Engine

vLLM 0.20.0

Context

256K

Concurrency

64

Write the config

Create and move into the project directory:
mkdir qwen3.6-27b-latency && cd qwen3.6-27b-latency
Then create a file named config.yaml and paste the following:
config.yaml
model_name: "model:qwen3.6-27b preset:latency"

model_metadata:
  example_model_input:
    model: "Qwen/Qwen3.6-27B"
    messages:
      - role: user
        content: "What is the capital of France?"
    stream: true
    max_tokens: 512
    temperature: 1.0
    top_p: 0.95
  tags:
    - openai-compatible

base_image:
  image: vllm/vllm-openai:v0.20.0

weights:
  - source: "hf://Qwen/Qwen3.6-27B@main"
    mount_location: "/app/checkpoint/qwen3.6-27b"
    auth_secret_name: "hf_access_token"

resources:
  accelerator: H100:4
  use_gpu: true

runtime:
  predict_concurrency: 64

environment_variables:
  HF_HUB_ENABLE_HF_TRANSFER: "1"
  VLLM_LOGGING_LEVEL: WARNING

secrets:
  hf_access_token: null

docker_server:
  start_command: >-
    sh -c "vllm serve /app/checkpoint/qwen3.6-27b
    --served-model-name Qwen/Qwen3.6-27B
    --host 0.0.0.0
    --port 8000
    --trust-remote-code
    --tensor-parallel-size 4
    --max-model-len 262144
    --language-model-only
    --reasoning-parser qwen3
    --enable-auto-tool-choice
    --tool-call-parser qwen3_coder
    --speculative_config.method mtp
    --speculative_config.num_speculative_tokens 2"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:
FlagValueWhat it does
--trust-remote-code(no value)Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
--tensor-parallel-size4Number of GPUs to shard the model across.
--max-model-len262144Maximum context length (tokens) the server accepts per request.
--language-model-only(no value)Disable the multimodal path; text-only serving. Remove to enable image/video inputs.
--reasoning-parserqwen3Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6).
--enable-auto-tool-choice(no value)Let the model choose when to call tools without requiring tool_choice: "required".
--tool-call-parserqwen3_coderServer-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format.
--speculative_config.methodmtpSpeculative decoding method. mtp: Multi-token prediction head speculation.
--speculative_config.num_speculative_tokens2Number of tokens the draft speculator proposes per step.

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model qwen3.6-27b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:
main.py
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)
To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate reasoning_content field on the response:
response = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer
To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)