Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
- 4B
- 9B
- 35B
- 122B
Qwen/Qwen3.5-4B is a 4B-parameter dense model with up to 256K context.This preset serves Qwen3.5-4B with BF16 weights on a single H100, optimized for low time-to-first-token.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate To let the model call tools, pass a
Hardware
H100 × 1
Engine
vLLM 0.18.0
Context
32K
Concurrency
128
Write the config
Create and move into the project directory:mkdir qwen3.5-4b-latency && cd qwen3.5-4b-latency
config.yaml and paste the following:config.yaml
model_name: "model:qwen3.5-4b preset:latency"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.5-4B"
messages:
- role: user
content: "What is the capital of France?"
max_tokens: 100
temperature: 0.7
base_image:
image: vllm/vllm-openai:v0.18.0
weights:
- source: "hf://Qwen/Qwen3.5-4B@main"
mount_location: "/app/checkpoint/qwen3.5-4b"
auth_secret_name: "hf_access_token"
build_commands: []
docker_server:
start_command: >-
sh -c "vllm serve /app/checkpoint/qwen3.5-4b
--served-model-name Qwen/Qwen3.5-4B
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.95
--max-model-len 32768
--dtype bfloat16
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--trust-remote-code"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: '1'
VLLM_LOGGING_LEVEL: WARNING
runtime:
predict_concurrency: 128
resources:
accelerator: H100:1
use_gpu: true
secrets:
hf_access_token: null
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--max-model-len | 32768 | Maximum context length (tokens) the server accepts per request. |
--dtype | bfloat16 | Weight precision loaded at runtime. bfloat16: BF16 weights, no quantization. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.5-4b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-4B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.5-4B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.5-4B",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="Qwen/Qwen3.5-4B",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
Qwen/Qwen3.5-9B is a 9B-parameter dense model with up to 256K context.This preset serves Qwen3.5-9B with BF16 weights on a single H100. It’s the smallest dense Qwen3.5 deployment that keeps reasoning and tool calling enabled.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate To let the model call tools, pass a
Hardware
H100 × 1
Engine
vLLM 0.18.0
Context
32K
Concurrency
128
Write the config
Create and move into the project directory:mkdir qwen3.5-9b-latency && cd qwen3.5-9b-latency
config.yaml and paste the following:config.yaml
model_name: "model:qwen3.5-9b preset:latency"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.5-9B"
messages:
- role: user
content: "What is the capital of France?"
max_tokens: 100
temperature: 0.7
base_image:
image: vllm/vllm-openai:v0.18.0
weights:
- source: "hf://Qwen/Qwen3.5-9B@main"
mount_location: "/app/checkpoint/qwen3.5-9b"
auth_secret_name: "hf_access_token"
build_commands: []
docker_server:
start_command: >-
sh -c "vllm serve /app/checkpoint/qwen3.5-9b
--served-model-name Qwen/Qwen3.5-9B
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.95
--max-model-len 32768
--dtype bfloat16
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--trust-remote-code"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: '1'
VLLM_LOGGING_LEVEL: WARNING
runtime:
predict_concurrency: 128
resources:
accelerator: H100:1
use_gpu: true
secrets:
hf_access_token: null
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--max-model-len | 32768 | Maximum context length (tokens) the server accepts per request. |
--dtype | bfloat16 | Weight precision loaded at runtime. bfloat16: BF16 weights, no quantization. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.5-9b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.5-9B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
Qwen/Qwen3.5-35B-A3B is a 35B-parameter hybrid MoE model (3B active per token) with up to 256K context.This variant ships in 2 presets tuned for different goals: Latency for lowest time-to-first-token, and Throughput for highest tokens per second. Pick the tab that matches your workload.
- Latency
- Throughput
This preset serves Qwen3.5-35B with BF16 weights on H100:2, optimized for low time-to-first-token on interactive chat and short-horizon agent workflows.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate To let the model call tools, pass a
Hardware
H100 × 2
Engine
vLLM 0.18.0
Context
32K
Concurrency
128
Write the config
Create and move into the project directory:mkdir qwen3.5-35b-latency && cd qwen3.5-35b-latency
config.yaml and paste the following:config.yaml
model_name: "model:qwen3.5-35b preset:latency"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.5-35B-A3B"
messages:
- role: user
content: "What is the capital of France?"
max_tokens: 100
temperature: 0.7
base_image:
image: vllm/vllm-openai:v0.18.0
weights:
- source: "hf://Qwen/Qwen3.5-35B-A3B@main"
mount_location: "/app/checkpoint/qwen3.5-35b-a3b"
auth_secret_name: "hf_access_token"
build_commands: []
docker_server:
start_command: >-
sh -c "vllm serve /app/checkpoint/qwen3.5-35b-a3b
--served-model-name Qwen/Qwen3.5-35B-A3B
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.95
--max-model-len 32768
--dtype bfloat16
--tensor-parallel-size 2
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--trust-remote-code"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: '1'
VLLM_LOGGING_LEVEL: WARNING
runtime:
predict_concurrency: 128
resources:
accelerator: H100:2
use_gpu: true
secrets:
hf_access_token: null
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--max-model-len | 32768 | Maximum context length (tokens) the server accepts per request. |
--dtype | bfloat16 | Weight precision loaded at runtime. bfloat16: BF16 weights, no quantization. |
--tensor-parallel-size | 2 | Number of GPUs to shard the model across. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.5-35b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.5-35B-A3B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
This preset serves Qwen3.5-35B FP8 on a single B200, with prefix caching and chunked prefill enabled. It maximizes aggregate throughput at high concurrency with minor quality impact from FP8.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate
Hardware
B200
Engine
vLLM 0.18.0
Context
256K
Concurrency
1000
Write the config
Create and move into the project directory:mkdir qwen3.5-35b-throughput && cd qwen3.5-35b-throughput
config.yaml and paste the following:config.yaml
########################################################
# Remove ( --language-model-only ) from the start command to turn on multimodal mode
########################################################
model_name: "model:qwen3.5-35b preset:throughput"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.5-35B-A3B-FP8"
messages:
- role: user
content: "What is the capital of France?"
max_tokens: 100
temperature: 0.7
base_image:
image: vllm/vllm-openai:v0.18.0
weights:
- source: "hf://Qwen/Qwen3.5-35B-A3B-FP8@main"
mount_location: "/app/model_cache/qwen3.5-35b-a3b-fp8"
auth_secret_name: "hf_access_token"
build_commands:
- pip install --upgrade transformers
environment_variables:
VLLM_USE_FLASHINFER_MOE_FP8: "0"
PYTORCH_ALLOC_CONF: "expandable_segments:True"
docker_server:
start_command: >-
vllm serve /app/model_cache/qwen3.5-35b-a3b-fp8
--served-model-name Qwen/Qwen3.5-35B-A3B-FP8
--host 0.0.0.0
--language-model-only
--port 8000
--gpu-memory-utilization 0.95
--kv-cache-dtype fp8
--reasoning-parser qwen3
--enable-chunked-prefill
--enable-prefix-caching
--max-num-seqs 512
--trust-remote-code
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
runtime:
predict_concurrency: 1000
health_checks:
restart_check_delay_seconds: 1500
restart_threshold_seconds: 1500
stop_traffic_threshold_seconds: 120
resources:
accelerator: B200
use_gpu: true
secrets:
hf_access_token: null
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--language-model-only | (no value) | Disable the multimodal path; text-only serving. Remove to enable image/video inputs. |
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--kv-cache-dtype | fp8 | KV cache numeric precision. fp8: ~2× KV cache density with negligible quality impact on most models. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-chunked-prefill | (no value) | Process long prompts in chunks so decode requests keep running. |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--max-num-seqs | 512 | Maximum number of concurrent sequences in the batch. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.5-35b-throughput was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B-FP8",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.5-35B-A3B-FP8",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B-FP8",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
Qwen/Qwen3.5-122B-A10B is a 122B-parameter MoE model (10B active per token) with up to 256K context.This preset serves Qwen3.5-122B with BF16 weights on H100:4. It keeps time-to-first-token low while fitting the full model on a single H100 node.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate To let the model call tools, pass a
Hardware
H100 × 4
Engine
vLLM 0.18.0
Context
32K
Concurrency
128
Write the config
Create and move into the project directory:mkdir qwen3.5-122b-latency && cd qwen3.5-122b-latency
config.yaml and paste the following:config.yaml
model_name: "model:qwen3.5-122b preset:latency"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.5-122B-A10B"
messages:
- role: user
content: "What is the capital of France?"
max_tokens: 100
temperature: 0.7
base_image:
image: vllm/vllm-openai:v0.18.0
weights:
- source: "hf://Qwen/Qwen3.5-122B-A10B@main"
mount_location: "/app/checkpoint/qwen3.5-122b-a10b"
auth_secret_name: "hf_access_token"
build_commands: []
docker_server:
start_command: >-
sh -c "vllm serve /app/checkpoint/qwen3.5-122b-a10b
--served-model-name Qwen/Qwen3.5-122B-A10B
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.95
--max-model-len 32768
--dtype bfloat16
--tensor-parallel-size 4
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--trust-remote-code"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: '1'
VLLM_LOGGING_LEVEL: WARNING
runtime:
predict_concurrency: 128
resources:
accelerator: H100:4
use_gpu: true
secrets:
hf_access_token: null
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--max-model-len | 32768 | Maximum context length (tokens) the server accepts per request. |
--dtype | bfloat16 | Weight precision loaded at runtime. bfloat16: BF16 weights, no quantization. |
--tensor-parallel-size | 4 | Number of GPUs to shard the model across. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.5-122b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-122B-A10B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.5-122B-A10B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.5-122B-A10B",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="Qwen/Qwen3.5-122B-A10B",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)