Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
- 27B
- 35B-A3B
Qwen/Qwen3.6-27B is a 27B-parameter dense model with up to 256K context.This preset serves Qwen3.6-27B on H100:4 with MTP speculative decoding, optimized for low time-to-first-token on interactive chat and agent workflows.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate To let the model call tools, pass a
Hardware
H100 × 4
Engine
vLLM 0.20.0
Context
256K
Concurrency
64
Write the config
Create and move into the project directory:mkdir qwen3.6-27b-latency && cd qwen3.6-27b-latency
config.yaml and paste the following:config.yaml
model_name: "model:qwen3.6-27b preset:latency"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.6-27B"
messages:
- role: user
content: "What is the capital of France?"
stream: true
max_tokens: 512
temperature: 1.0
top_p: 0.95
tags:
- openai-compatible
base_image:
image: vllm/vllm-openai:v0.20.0
weights:
- source: "hf://Qwen/Qwen3.6-27B@main"
mount_location: "/app/checkpoint/qwen3.6-27b"
auth_secret_name: "hf_access_token"
resources:
accelerator: H100:4
use_gpu: true
runtime:
predict_concurrency: 64
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: "1"
VLLM_LOGGING_LEVEL: WARNING
secrets:
hf_access_token: null
docker_server:
start_command: >-
sh -c "vllm serve /app/checkpoint/qwen3.6-27b
--served-model-name Qwen/Qwen3.6-27B
--host 0.0.0.0
--port 8000
--trust-remote-code
--tensor-parallel-size 4
--max-model-len 262144
--language-model-only
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--speculative_config.method mtp
--speculative_config.num_speculative_tokens 2"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--tensor-parallel-size | 4 | Number of GPUs to shard the model across. |
--max-model-len | 262144 | Maximum context length (tokens) the server accepts per request. |
--language-model-only | (no value) | Disable the multimodal path; text-only serving. Remove to enable image/video inputs. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--speculative_config.method | mtp | Speculative decoding method. mtp: Multi-token prediction head speculation. |
--speculative_config.num_speculative_tokens | 2 | Number of tokens the draft speculator proposes per step. |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.6-27b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.6-27B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.6-27B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.6-27B",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="Qwen/Qwen3.6-27B",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
Qwen/Qwen3.6-35B-A3B is a 35B-parameter hybrid MoE model (3B active per token) with up to 256K context.This variant ships in 2 presets tuned for different goals: Latency for lowest time-to-first-token, and Throughput for highest tokens per second. Pick the tab that matches your workload.
- Latency
- Throughput
This preset serves Qwen3.6-35B-A3B on H100:4 with MTP speculative decoding, optimized for low time-to-first-token on interactive chat and short-horizon agent workflows.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate To let the model call tools, pass a
Hardware
H100 × 4
Engine
vLLM 0.20.0
Context
256K
Concurrency
64
Write the config
Create and move into the project directory:mkdir qwen3.6-35b-a3b-latency && cd qwen3.6-35b-a3b-latency
config.yaml and paste the following:config.yaml
model_name: "model:qwen3.6-35b-a3b preset:latency"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.6-35B-A3B"
messages:
- role: user
content: "What is the capital of France?"
stream: true
max_tokens: 512
temperature: 1.0
top_p: 0.95
tags:
- openai-compatible
base_image:
image: vllm/vllm-openai:v0.20.0
weights:
- source: "hf://Qwen/Qwen3.6-35B-A3B@main"
mount_location: "/app/checkpoint/qwen3.6-35b-a3b"
auth_secret_name: "hf_access_token"
resources:
accelerator: H100:4
use_gpu: true
runtime:
predict_concurrency: 64
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: "1"
VLLM_LOGGING_LEVEL: WARNING
secrets:
hf_access_token: null
docker_server:
start_command: >-
sh -c "vllm serve /app/checkpoint/qwen3.6-35b-a3b
--served-model-name Qwen/Qwen3.6-35B-A3B
--host 0.0.0.0
--port 8000
--trust-remote-code
--tensor-parallel-size 4
--max-model-len 262144
--language-model-only
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--speculative_config.method mtp
--speculative_config.num_speculative_tokens 2"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--tensor-parallel-size | 4 | Number of GPUs to shard the model across. |
--max-model-len | 262144 | Maximum context length (tokens) the server accepts per request. |
--language-model-only | (no value) | Disable the multimodal path; text-only serving. Remove to enable image/video inputs. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--speculative_config.method | mtp | Speculative decoding method. mtp: Multi-token prediction head speculation. |
--speculative_config.num_speculative_tokens | 2 | Number of tokens the draft speculator proposes per step. |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.6-35b-a3b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.6-35B-A3B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
This preset serves the RedHatAI NVFP4 quantization of Qwen3.6-35B-A3B on a single B200, with FlashInfer MoE kernels, chunked prefill, and prefix caching enabled. It maximizes aggregate throughput at high concurrency.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate To let the model call tools, pass a
Hardware
B200
Engine
vLLM (nightly build)
Context
256K
Concurrency
1000
Write the config
Create and move into the project directory:mkdir qwen3.6-35b-a3b-throughput && cd qwen3.6-35b-a3b-throughput
config.yaml and paste the following:config.yaml
model_name: "model:qwen3.6-35b-a3b preset:throughput"
model_metadata:
example_model_input:
model: "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
messages:
- role: user
content: "What is the capital of France?"
max_tokens: 100
temperature: 0.7
tags:
- openai-compatible
- vllm
- qwen3.6
- nvfp4
- b200
base_image:
image: vllm/vllm-openai:nightly
weights:
- source: "hf://RedHatAI/Qwen3.6-35B-A3B-NVFP4@main"
mount_location: "/app/model_cache/qwen3.6-35b-a3b-nvfp4"
auth_secret_name: "hf_access_token"
build_commands: []
environment_variables:
PYTORCH_ALLOC_CONF: "expandable_segments:True"
VLLM_FLASHINFER_MOE_BACKEND: throughput
VLLM_USE_FLASHINFER_MOE_FP4: 1
VLLM_USE_FLASHINFER_MOE_FP8: 1
docker_server:
start_command: >-
vllm serve /app/model_cache/qwen3.6-35b-a3b-nvfp4
--served-model-name RedHatAI/Qwen3.6-35B-A3B-NVFP4
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.95
--max-model-len 262144
--max-num-batched-tokens 32768
--dtype auto
--enable-chunked-prefill
--enable-prefix-caching
--max-num-seqs 512
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--moe_backend flashinfer_cutlass
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
--trust-remote-code
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
runtime:
predict_concurrency: 1000
health_checks:
restart_check_delay_seconds: 1500
restart_threshold_seconds: 1500
stop_traffic_threshold_seconds: 120
resources:
accelerator: B200
use_gpu: true
secrets:
hf_access_token: null
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--max-model-len | 262144 | Maximum context length (tokens) the server accepts per request. |
--max-num-batched-tokens | 32768 | Maximum total tokens processed per scheduler step. |
--dtype | auto | Weight precision loaded at runtime. auto: Match the model’s checkpoint dtype (default). |
--enable-chunked-prefill | (no value) | Process long prompts in chunks so decode requests keep running. |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--max-num-seqs | 512 | Maximum number of concurrent sequences in the batch. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--moe_backend | flashinfer_cutlass | MoE expert dispatch kernel. Engine-specific values select between routing implementations tuned for different hardware or model layouts. |
--speculative-config | {"method":"qwen3_5_mtp","num_speculative_tokens":3} | Speculative decoding configuration as a JSON object. The dotted form (--speculative-config.method, --speculative-config.num_speculative_tokens, …) sets the same fields one at a time. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.6-35b-a3b-throughput was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="RedHatAI/Qwen3.6-35B-A3B-NVFP4",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="RedHatAI/Qwen3.6-35B-A3B-NVFP4",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="RedHatAI/Qwen3.6-35B-A3B-NVFP4",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)