Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Some Model APIs support extended thinking, where the model reasons through a problem before producing a final answer.
The reasoning process generates additional tokens that appear in a separate reasoning_content field, distinct from the final response.
Supported models
| Model | Slug | Reasoning |
|---|
| DeepSeek V3.1 | deepseek-ai/DeepSeek-V3.1 | Enabled by default |
| DeepSeek V4 Pro | deepseek-ai/DeepSeek-V4-Pro | Enabled by default |
| Minimax M2.5 | MiniMaxAI/MiniMax-M2.5 | Enabled by default |
| Nemotron Super | nvidia/Nemotron-120B-A12B | Enabled by default |
| OpenAI GPT OSS 120B | openai/gpt-oss-120b | Enabled by default |
| Kimi K2.5 | moonshotai/Kimi-K2.5 | Opt-in via chat_template_args |
| Kimi K2.6 | moonshotai/Kimi-K2.6 | Opt-in via chat_template_args |
| GLM 4.7 | zai-org/GLM-4.7 | Opt-in via chat_template_args |
| GLM 5 | zai-org/GLM-5 | Opt-in via chat_template_args |
DeepSeek V4 Pro and GPT OSS 120B also support reasoning_effort.
Models not listed here don’t support reasoning.
Enable thinking
Enable thinking for Kimi K2.5, Kimi K2.6, and GLM 4.7 by passing chat_template_args.
Pass chat_template_args through extra_body since it extends the standard OpenAI API:response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
extra_body={"chat_template_args": {"enable_thinking": True}},
max_tokens=4096,
stream=True,
)
Include chat_template_args directly in the request options:const response = await client.chat.completions.create({
model: "moonshotai/Kimi-K2.5",
messages: [{ role: "user", content: "What is the sum of the first 100 prime numbers?" }],
chat_template_args: { enable_thinking: true },
max_tokens: 4096,
stream: true,
});
Include chat_template_args in the JSON request body:curl https://inference.baseten.co/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "moonshotai/Kimi-K2.5",
"messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
"chat_template_args": {"enable_thinking": true},
"max_tokens": 4096,
"stream": false
}'
Control reasoning depth
The reasoning_effort parameter controls how thoroughly the model reasons through a problem.
DeepSeek V4 Pro and GPT OSS 120B support this parameter.
| Value | Behavior |
|---|
low | Faster responses, less thorough reasoning |
medium | Balanced (default) |
high | Slower responses, more thorough reasoning |
xhigh | Maximum reasoning depth, highest token cost (DeepSeek V4 Pro only) |
DeepSeek V4 Pro
GPT OSS 120B
Pass reasoning_effort through extra_body since it extends the standard OpenAI API:from openai import OpenAI
import os
client = OpenAI(
base_url="https://inference.baseten.co/v1",
api_key=os.environ.get("BASETEN_API_KEY")
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Pro",
messages=[
{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}
],
extra_body={"reasoning_effort": "high"}
)
print(response.choices[0].message.content)
Include reasoning_effort directly in the request options:import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://inference.baseten.co/v1",
apiKey: process.env.BASETEN_API_KEY,
});
const response = await client.chat.completions.create({
model: "deepseek-ai/DeepSeek-V4-Pro",
messages: [
{ role: "user", content: "What is the sum of the first 100 prime numbers?" }
],
reasoning_effort: "high"
});
console.log(response.choices[0].message.content);
Include reasoning_effort in the JSON request body:curl https://inference.baseten.co/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
"reasoning_effort": "high"
}'
Pass reasoning_effort through extra_body since it extends the standard OpenAI API:from openai import OpenAI
import os
client = OpenAI(
base_url="https://inference.baseten.co/v1",
api_key=os.environ.get("BASETEN_API_KEY")
)
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[
{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}
],
extra_body={"reasoning_effort": "high"}
)
print(response.choices[0].message.content)
Include reasoning_effort directly in the request options:import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://inference.baseten.co/v1",
apiKey: process.env.BASETEN_API_KEY,
});
const response = await client.chat.completions.create({
model: "openai/gpt-oss-120b",
messages: [
{ role: "user", content: "What is the sum of the first 100 prime numbers?" }
],
reasoning_effort: "high"
});
console.log(response.choices[0].message.content);
Include reasoning_effort in the JSON request body:curl https://inference.baseten.co/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "openai/gpt-oss-120b",
"messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
"reasoning_effort": "high"
}'
Reasoning improves quality for tasks that benefit from step-by-step thinking: mathematical calculations, multi-step logic problems, code generation with complex requirements, and analysis requiring multiple considerations.
For straightforward tasks like simple Q&A or text generation, reasoning adds latency and token cost without improving quality.
In these cases, use a model without reasoning support or set reasoning_effort to low.
Parse the response
The model’s thinking process appears in reasoning_content, separate from the final answer in content.
Both fields are returned on the message object.
Read reasoning_content and content directly off the message object:from openai import OpenAI
import os
client = OpenAI(
base_url="https://inference.baseten.co/v1",
api_key=os.environ.get("BASETEN_API_KEY"),
)
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.6",
messages=[{"role": "user", "content": "Is 91 a prime number? Answer in one sentence."}],
extra_body={"chat_template_args": {"enable_thinking": True}},
)
message = response.choices[0].message
print("Reasoning:", message.reasoning_content)
print("Answer:", message.content)
Read reasoning_content and content from the returned message:import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://inference.baseten.co/v1",
apiKey: process.env.BASETEN_API_KEY,
});
const response = await client.chat.completions.create({
model: "moonshotai/Kimi-K2.6",
messages: [{ role: "user", content: "Is 91 a prime number? Answer in one sentence." }],
chat_template_args: { enable_thinking: true },
});
const message = response.choices[0].message;
console.log("Reasoning:", message.reasoning_content);
console.log("Answer:", message.content);
Pipe the response through jq to extract each field:curl https://inference.baseten.co/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "moonshotai/Kimi-K2.6",
"messages": [{"role": "user", "content": "Is 91 a prime number? Answer in one sentence."}],
"chat_template_args": {"enable_thinking": true}
}' | jq '.choices[0].message | {reasoning: .reasoning_content, answer: .content}'
The response body contains both fields on the assistant message:
{
"choices": [
{
"message": {
"role": "assistant",
"reasoning_content": "The user is asking whether 91 is a prime number... 91 = 7 × 13, so it is not prime...",
"content": "No, 91 is not a prime number because it can be factored as $7 \\times 13$."
}
}
],
"usage": {
"prompt_tokens": 21,
"completion_tokens": 203,
"total_tokens": 224
}
}
Reasoning tokens are included in completion_tokens and count toward your total usage and billing.