Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
- 4B
- 9B
- 35B
- 122B
Qwen/Qwen3.5-4B is a 4B-parameter dense model with up to 256K context.This preset serves Qwen3.5-4B with BF16 weights on a single H100, optimized for low time-to-first-token.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate To let the model call tools, pass a
Hardware
H100 × 1
Engine
vLLM 0.18.0
Context
32K
Concurrency
128
Write the config
Create and move into the project directory:mkdir qwen3.5-4b-latency && cd qwen3.5-4b-latency
config.yaml and paste the following:config.yaml
model_name: "model:qwen3.5-4b preset:latency"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.5-4B"
messages:
- role: user
content: "What is the capital of France?"
max_tokens: 100
temperature: 0.7
base_image:
image: vllm/vllm-openai:v0.18.0
weights:
- source: "hf://Qwen/Qwen3.5-4B@main"
mount_location: "/app/checkpoint/qwen3.5-4b"
auth_secret_name: "hf_access_token"
build_commands: []
docker_server:
start_command: >-
sh -c "vllm serve /app/checkpoint/qwen3.5-4b
--served-model-name Qwen/Qwen3.5-4B
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.95
--max-model-len 32768
--dtype bfloat16
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--trust-remote-code"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: '1'
VLLM_LOGGING_LEVEL: WARNING
runtime:
predict_concurrency: 128
resources:
accelerator: H100:1
use_gpu: true
secrets:
hf_access_token: null
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--max-model-len | 32768 | Maximum context length (tokens) the server accepts per request. |
--dtype | bfloat16 | Weight precision loaded at runtime. bfloat16: BF16 weights, no quantization. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.5-4b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-4B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.5-4B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.5-4B",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="Qwen/Qwen3.5-4B",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
Qwen/Qwen3.5-9B is a 9B-parameter dense model with up to 256K context.This preset serves Qwen3.5-9B with BF16 weights on a single H100. It’s the smallest dense Qwen3.5 deployment that keeps reasoning and tool calling enabled.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate To let the model call tools, pass a
Hardware
H100 × 1
Engine
vLLM 0.18.0
Context
32K
Concurrency
128
Write the config
Create and move into the project directory:mkdir qwen3.5-9b-latency && cd qwen3.5-9b-latency
config.yaml and paste the following:config.yaml
model_name: "model:qwen3.5-9b preset:latency"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.5-9B"
messages:
- role: user
content: "What is the capital of France?"
max_tokens: 100
temperature: 0.7
base_image:
image: vllm/vllm-openai:v0.18.0
weights:
- source: "hf://Qwen/Qwen3.5-9B@main"
mount_location: "/app/checkpoint/qwen3.5-9b"
auth_secret_name: "hf_access_token"
build_commands: []
docker_server:
start_command: >-
sh -c "vllm serve /app/checkpoint/qwen3.5-9b
--served-model-name Qwen/Qwen3.5-9B
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.95
--max-model-len 32768
--dtype bfloat16
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--trust-remote-code"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: '1'
VLLM_LOGGING_LEVEL: WARNING
runtime:
predict_concurrency: 128
resources:
accelerator: H100:1
use_gpu: true
secrets:
hf_access_token: null
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--max-model-len | 32768 | Maximum context length (tokens) the server accepts per request. |
--dtype | bfloat16 | Weight precision loaded at runtime. bfloat16: BF16 weights, no quantization. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.5-9b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.5-9B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
Qwen/Qwen3.5-35B-A3B is a 35B-parameter hybrid MoE model (3B active per token) with up to 256K context.This variant ships in 2 presets tuned for different goals: Latency for lowest time-to-first-token, and Throughput for highest tokens per second. Pick the tab that matches your workload.
- Latency
- Throughput
This preset serves Qwen3.5-35B with BF16 weights on H100:2, optimized for low time-to-first-token on interactive chat and short-horizon agent workflows.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate To let the model call tools, pass a
Hardware
H100 × 2
Engine
vLLM 0.18.0
Context
32K
Concurrency
128
Write the config
Create and move into the project directory:mkdir qwen3.5-35b-latency && cd qwen3.5-35b-latency
config.yaml and paste the following:config.yaml
model_name: "model:qwen3.5-35b preset:latency"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.5-35B-A3B"
messages:
- role: user
content: "What is the capital of France?"
max_tokens: 100
temperature: 0.7
base_image:
image: vllm/vllm-openai:v0.18.0
weights:
- source: "hf://Qwen/Qwen3.5-35B-A3B@main"
mount_location: "/app/checkpoint/qwen3.5-35b-a3b"
auth_secret_name: "hf_access_token"
build_commands: []
docker_server:
start_command: >-
sh -c "vllm serve /app/checkpoint/qwen3.5-35b-a3b
--served-model-name Qwen/Qwen3.5-35B-A3B
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.95
--max-model-len 32768
--dtype bfloat16
--tensor-parallel-size 2
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--trust-remote-code"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: '1'
VLLM_LOGGING_LEVEL: WARNING
runtime:
predict_concurrency: 128
resources:
accelerator: H100:2
use_gpu: true
secrets:
hf_access_token: null
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--max-model-len | 32768 | Maximum context length (tokens) the server accepts per request. |
--dtype | bfloat16 | Weight precision loaded at runtime. bfloat16: BF16 weights, no quantization. |
--tensor-parallel-size | 2 | Number of GPUs to shard the model across. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.5-35b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.5-35B-A3B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)
This preset serves Qwen3.5-35B FP8 on a single B200, with prefix caching and chunked prefill enabled. It maximizes aggregate throughput at high concurrency with minor quality impact from FP8.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate
Hardware
B200
Engine
vLLM 0.18.0
Context
256K
Concurrency
1000
Write the config
Create and move into the project directory:mkdir qwen3.5-35b-throughput && cd qwen3.5-35b-throughput
config.yaml and paste the following:config.yaml
########################################################
# Remove ( --language-model-only ) from the start command to turn on multimodal mode
########################################################
model_name: "model:qwen3.5-35b preset:throughput"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.5-35B-A3B-FP8"
messages:
- role: user
content: "What is the capital of France?"
max_tokens: 100
temperature: 0.7
base_image:
image: vllm/vllm-openai:v0.18.0
weights:
- source: "hf://Qwen/Qwen3.5-35B-A3B-FP8@main"
mount_location: "/app/model_cache/qwen3.5-35b-a3b-fp8"
auth_secret_name: "hf_access_token"
build_commands:
- pip install --upgrade transformers
environment_variables:
VLLM_USE_FLASHINFER_MOE_FP8: "0"
PYTORCH_ALLOC_CONF: "expandable_segments:True"
docker_server:
start_command: >-
vllm serve /app/model_cache/qwen3.5-35b-a3b-fp8
--served-model-name Qwen/Qwen3.5-35B-A3B-FP8
--host 0.0.0.0
--language-model-only
--port 8000
--gpu-memory-utilization 0.95
--kv-cache-dtype fp8
--reasoning-parser qwen3
--enable-chunked-prefill
--enable-prefix-caching
--max-num-seqs 512
--trust-remote-code
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
runtime:
predict_concurrency: 1000
health_checks:
restart_check_delay_seconds: 1500
restart_threshold_seconds: 1500
stop_traffic_threshold_seconds: 120
resources:
accelerator: B200
use_gpu: true
secrets:
hf_access_token: null
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--language-model-only | (no value) | Disable the multimodal path; text-only serving. Remove to enable image/video inputs. |
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--kv-cache-dtype | fp8 | KV cache numeric precision. fp8: ~2× KV cache density with negligible quality impact on most models. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-chunked-prefill | (no value) | Process long prompts in chunks so decode requests keep running. |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--max-num-seqs | 512 | Maximum number of concurrent sequences in the batch. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.5-35b-throughput was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B-FP8",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.5-35B-A3B-FP8",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B-FP8",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
Qwen/Qwen3.5-122B-A10B is a 122B-parameter MoE model (10B active per token) with up to 256K context.This preset serves Qwen3.5-122B with BF16 weights on H100:4. It keeps time-to-first-token low while fitting the full model on a single H100 node.Then create a file named
You should see output similar to:Your model ID is the string after To access the model’s chain of thought, enable thinking mode. The server parses the reasoning output into a separate To let the model call tools, pass a
Hardware
H100 × 4
Engine
vLLM 0.18.0
Context
32K
Concurrency
128
Write the config
Create and move into the project directory:mkdir qwen3.5-122b-latency && cd qwen3.5-122b-latency
config.yaml and paste the following:config.yaml
model_name: "model:qwen3.5-122b preset:latency"
model_metadata:
example_model_input:
model: "Qwen/Qwen3.5-122B-A10B"
messages:
- role: user
content: "What is the capital of France?"
max_tokens: 100
temperature: 0.7
base_image:
image: vllm/vllm-openai:v0.18.0
weights:
- source: "hf://Qwen/Qwen3.5-122B-A10B@main"
mount_location: "/app/checkpoint/qwen3.5-122b-a10b"
auth_secret_name: "hf_access_token"
build_commands: []
docker_server:
start_command: >-
sh -c "vllm serve /app/checkpoint/qwen3.5-122b-a10b
--served-model-name Qwen/Qwen3.5-122B-A10B
--host 0.0.0.0
--port 8000
--gpu-memory-utilization 0.95
--max-model-len 32768
--dtype bfloat16
--tensor-parallel-size 4
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--trust-remote-code"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
environment_variables:
HF_HUB_ENABLE_HF_TRANSFER: '1'
VLLM_LOGGING_LEVEL: WARNING
runtime:
predict_concurrency: 128
resources:
accelerator: H100:4
use_gpu: true
secrets:
hf_access_token: null
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--gpu-memory-utilization | 0.95 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--max-model-len | 32768 | Maximum context length (tokens) the server accepts per request. |
--dtype | bfloat16 | Weight precision loaded at runtime. bfloat16: BF16 weights, no quantization. |
--tensor-parallel-size | 4 | Number of GPUs to shard the model across. |
--reasoning-parser | qwen3 | Server-side parser that separates reasoning output into reasoning_content. qwen3: Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--tool-call-parser | qwen3_coder | Server-side parser that emits structured tool_calls on the response. qwen3_coder: Qwen3-Coder tool format. |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
Deploy
Push the config to Baseten:uvx truss push
✨ Model qwen3.5-122b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-122B-A10B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3.5-122B-A10B",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
reasoning_content field on the response:response = client.chat.completions.create(
model="Qwen/Qwen3.5-122B-A10B",
messages=[
{"role": "user", "content": "How many r's in strawberry?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content) # chain of thought
print(response.choices[0].message.content) # final answer
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="Qwen/Qwen3.5-122B-A10B",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)