Qwen3.6

Reasoning Tool calling Agentic Long context

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

Pick the model you want to deploy. Each tab is a self-contained recipe.

27B
35B-A3B

Qwen/Qwen3.6-27B is a 27B-parameter dense model with up to 256K context.This preset serves Qwen3.6-27B on H100:4 with MTP speculative decoding, optimized for low time-to-first-token on interactive chat and agent workflows.

Hardware

H100 × 4

Engine

vLLM 0.20.0

Context

256K

Concurrency

Write the config

Create and move into the project directory:

mkdir qwen3.6-27b-latency && cd qwen3.6-27b-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:qwen3.6-27b preset:latency"

model_metadata:
  example_model_input:
    model: "Qwen/Qwen3.6-27B"
    messages:
      - role: user
        content: "What is the capital of France?"
    stream: true
    max_tokens: 512
    temperature: 1.0
    top_p: 0.95
  tags:
    - openai-compatible

base_image:
  image: vllm/vllm-openai:v0.20.0

weights:
  - source: "hf://Qwen/Qwen3.6-27B@main"
    mount_location: "/app/checkpoint/qwen3.6-27b"
    auth_secret_name: "hf_access_token"

resources:
  accelerator: H100:4
  use_gpu: true

runtime:
  predict_concurrency: 64

environment_variables:
  HF_HUB_ENABLE_HF_TRANSFER: "1"
  VLLM_LOGGING_LEVEL: WARNING

secrets:
  hf_access_token: null

docker_server:
  start_command: >-
    sh -c "vllm serve /app/checkpoint/qwen3.6-27b
    --served-model-name Qwen/Qwen3.6-27B
    --host 0.0.0.0
    --port 8000
    --trust-remote-code
    --tensor-parallel-size 4
    --max-model-len 262144
    --language-model-only
    --reasoning-parser qwen3
    --enable-auto-tool-choice
    --tool-call-parser qwen3_coder
    --speculative_config.method mtp
    --speculative_config.num_speculative_tokens 2"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
`--tensor-parallel-size`	`4`	Number of GPUs to shard the model across.
`--max-model-len`	`262144`	Maximum context length (tokens) the server accepts per request.
`--language-model-only`	(no value)	Disable the multimodal path; text-only serving. Remove to enable image/video inputs.
`--reasoning-parser`	`qwen3`	Server-side parser that separates reasoning output into `reasoning_content`. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6).
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.
`--tool-call-parser`	`qwen3_coder`	Server-side parser that emits structured `tool_calls` on the response. qwen3_coder: Qwen3-Coder tool format.
`--speculative_config.method`	`mtp`	Speculative decoding method. mtp: Multi-token prediction head speculation.
`--speculative_config.num_speculative_tokens`	`2`	Number of tokens the draft speculator proposes per step.

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model qwen3.6-27b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "Qwen/Qwen3.6-27B",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate reasoning_content field on the response:

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

Qwen/Qwen3.6-35B-A3B is a 35B-parameter hybrid MoE model (3B active per token) with up to 256K context.This variant ships in 2 presets tuned for different goals: Latency for lowest time-to-first-token, and Throughput for highest tokens per second. Pick the tab that matches your workload.

Latency
Throughput

This preset serves Qwen3.6-35B-A3B on H100:4 with MTP speculative decoding, optimized for low time-to-first-token on interactive chat and short-horizon agent workflows.

Hardware

H100 × 4

Engine

vLLM 0.20.0

Context

256K

Concurrency

Write the config

Create and move into the project directory:

mkdir qwen3.6-35b-a3b-latency && cd qwen3.6-35b-a3b-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:qwen3.6-35b-a3b preset:latency"

model_metadata:
  example_model_input:
    model: "Qwen/Qwen3.6-35B-A3B"
    messages:
      - role: user
        content: "What is the capital of France?"
    stream: true
    max_tokens: 512
    temperature: 1.0
    top_p: 0.95
  tags:
    - openai-compatible

base_image:
  image: vllm/vllm-openai:v0.20.0

weights:
  - source: "hf://Qwen/Qwen3.6-35B-A3B@main"
    mount_location: "/app/checkpoint/qwen3.6-35b-a3b"
    auth_secret_name: "hf_access_token"

resources:
  accelerator: H100:4
  use_gpu: true

runtime:
  predict_concurrency: 64

environment_variables:
  HF_HUB_ENABLE_HF_TRANSFER: "1"
  VLLM_LOGGING_LEVEL: WARNING

secrets:
  hf_access_token: null

docker_server:
  start_command: >-
    sh -c "vllm serve /app/checkpoint/qwen3.6-35b-a3b
    --served-model-name Qwen/Qwen3.6-35B-A3B
    --host 0.0.0.0
    --port 8000
    --trust-remote-code
    --tensor-parallel-size 4
    --max-model-len 262144
    --language-model-only
    --reasoning-parser qwen3
    --enable-auto-tool-choice
    --tool-call-parser qwen3_coder
    --speculative_config.method mtp
    --speculative_config.num_speculative_tokens 2"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
`--tensor-parallel-size`	`4`	Number of GPUs to shard the model across.
`--max-model-len`	`262144`	Maximum context length (tokens) the server accepts per request.
`--language-model-only`	(no value)	Disable the multimodal path; text-only serving. Remove to enable image/video inputs.
`--reasoning-parser`	`qwen3`	Server-side parser that separates reasoning output into `reasoning_content`. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6).
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.
`--tool-call-parser`	`qwen3_coder`	Server-side parser that emits structured `tool_calls` on the response. qwen3_coder: Qwen3-Coder tool format.
`--speculative_config.method`	`mtp`	Speculative decoding method. mtp: Multi-token prediction head speculation.
`--speculative_config.num_speculative_tokens`	`2`	Number of tokens the draft speculator proposes per step.

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model qwen3.6-35b-a3b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate reasoning_content field on the response:

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

This preset serves the RedHatAI NVFP4 quantization of Qwen3.6-35B-A3B on a single B200, with FlashInfer MoE kernels, chunked prefill, and prefix caching enabled. It maximizes aggregate throughput at high concurrency.

Hardware

B200

Engine

vLLM (nightly build)

Context

256K

Concurrency

1000

Write the config

Create and move into the project directory:

mkdir qwen3.6-35b-a3b-throughput && cd qwen3.6-35b-a3b-throughput

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:qwen3.6-35b-a3b preset:throughput"
model_metadata:
  example_model_input:
    model: "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
    messages:
      - role: user
        content: "What is the capital of France?"
    max_tokens: 100
    temperature: 0.7
  tags:
    - openai-compatible
    - vllm
    - qwen3.6
    - nvfp4
    - b200
base_image:
  image: vllm/vllm-openai:nightly
weights:
  - source: "hf://RedHatAI/Qwen3.6-35B-A3B-NVFP4@main"
    mount_location: "/app/model_cache/qwen3.6-35b-a3b-nvfp4"
    auth_secret_name: "hf_access_token"
build_commands: []
environment_variables:
  PYTORCH_ALLOC_CONF: "expandable_segments:True"
  VLLM_FLASHINFER_MOE_BACKEND: throughput
  VLLM_USE_FLASHINFER_MOE_FP4: 1
  VLLM_USE_FLASHINFER_MOE_FP8: 1
docker_server:
  start_command: >-
    vllm serve /app/model_cache/qwen3.6-35b-a3b-nvfp4
    --served-model-name RedHatAI/Qwen3.6-35B-A3B-NVFP4
    --host 0.0.0.0
    --port 8000
    --gpu-memory-utilization 0.95
    --max-model-len 262144
    --max-num-batched-tokens 32768
    --dtype auto
    --enable-chunked-prefill
    --enable-prefix-caching
    --max-num-seqs 512
    --reasoning-parser qwen3
    --enable-auto-tool-choice
    --tool-call-parser qwen3_coder
    --moe_backend flashinfer_cutlass
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
    --trust-remote-code
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
runtime:
  predict_concurrency: 1000
  health_checks:
    restart_check_delay_seconds: 1500
    restart_threshold_seconds: 1500
    stop_traffic_threshold_seconds: 120
resources:
  accelerator: B200
  use_gpu: true
secrets:
  hf_access_token: null

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--gpu-memory-utilization`	`0.95`	Fraction of GPU memory vLLM may use for weights and KV cache.
`--max-model-len`	`262144`	Maximum context length (tokens) the server accepts per request.
`--max-num-batched-tokens`	`32768`	Maximum total tokens processed per scheduler step.
`--dtype`	`auto`	Weight precision loaded at runtime. auto: Match the model’s checkpoint dtype (default).
`--enable-chunked-prefill`	(no value)	Process long prompts in chunks so decode requests keep running.
`--enable-prefix-caching`	(no value)	Reuse KV cache across requests that share a prefix.
`--max-num-seqs`	`512`	Maximum number of concurrent sequences in the batch.
`--reasoning-parser`	`qwen3`	Server-side parser that separates reasoning output into `reasoning_content`. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6).
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.
`--tool-call-parser`	`qwen3_coder`	Server-side parser that emits structured `tool_calls` on the response. qwen3_coder: Qwen3-Coder tool format.
`--moe_backend`	`flashinfer_cutlass`	MoE expert dispatch kernel. Engine-specific values select between routing implementations tuned for different hardware or model layouts.
`--speculative-config`	`{"method":"qwen3_5_mtp","num_speculative_tokens":3}`	Speculative decoding configuration as a JSON object. The dotted form (`--speculative-config.method`, `--speculative-config.num_speculative_tokens`, …) sets the same fields one at a time.
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model qwen3.6-35b-a3b-throughput was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="RedHatAI/Qwen3.6-35B-A3B-NVFP4",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate reasoning_content field on the response:

response = client.chat.completions.create(
    model="RedHatAI/Qwen3.6-35B-A3B-NVFP4",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="RedHatAI/Qwen3.6-35B-A3B-NVFP4",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Documentation Index

​Setup

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Setup

Write the config

Flags

Deploy

Call the model

Write the config

Flags

Deploy

Call the model

Write the config

Flags

Deploy

Call the model