Reasoning

Some Model APIs support extended thinking, where the model reasons through a problem before producing a final answer. The reasoning process generates additional tokens that appear in a separate reasoning_content field, distinct from the final response.

Supported models

Model	Slug	Reasoning
DeepSeek V3.1	`deepseek-ai/DeepSeek-V3.1`	Enabled by default
DeepSeek V4 Pro	`deepseek-ai/DeepSeek-V4-Pro`	Enabled by default
Minimax M2.5	`MiniMaxAI/MiniMax-M2.5`	Enabled by default
Nemotron Super	`nvidia/Nemotron-120B-A12B`	Enabled by default
OpenAI GPT OSS 120B	`openai/gpt-oss-120b`	Enabled by default
Kimi K2.5	`moonshotai/Kimi-K2.5`	Opt-in via `chat_template_args`
Kimi K2.6	`moonshotai/Kimi-K2.6`	Opt-in via `chat_template_args`
GLM 4.7	`zai-org/GLM-4.7`	Opt-in via `chat_template_args`
GLM 5	`zai-org/GLM-5`	Opt-in via `chat_template_args`

DeepSeek V4 Pro and GPT OSS 120B also support reasoning_effort. Models not listed here don’t support reasoning.

Enable thinking

Enable thinking for Kimi K2.5, Kimi K2.6, and GLM 4.7 by passing chat_template_args.

Python
JavaScript
cURL

Pass chat_template_args through extra_body since it extends the standard OpenAI API:

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
    extra_body={"chat_template_args": {"enable_thinking": True}},
    max_tokens=4096,
    stream=True,
)

Include chat_template_args directly in the request options:

const response = await client.chat.completions.create({
    model: "moonshotai/Kimi-K2.5",
    messages: [{ role: "user", content: "What is the sum of the first 100 prime numbers?" }],
    chat_template_args: { enable_thinking: true },
    max_tokens: 4096,
    stream: true,
});

Include chat_template_args in the JSON request body:

curl https://inference.baseten.co/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "moonshotai/Kimi-K2.5",
    "messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
    "chat_template_args": {"enable_thinking": true},
    "max_tokens": 4096,
    "stream": false
  }'

Control reasoning depth

The reasoning_effort parameter controls how thoroughly the model reasons through a problem. DeepSeek V4 Pro and GPT OSS 120B support this parameter.

Value	Behavior
`low`	Faster responses, less thorough reasoning
`medium`	Balanced (default)
`high`	Slower responses, more thorough reasoning
`xhigh`	Maximum reasoning depth, highest token cost (DeepSeek V4 Pro only)

DeepSeek V4 Pro
GPT OSS 120B

Python
JavaScript
cURL

Pass reasoning_effort through extra_body since it extends the standard OpenAI API:

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ.get("BASETEN_API_KEY")
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {"role": "user", "content": "What is the sum of the first 100 prime numbers?"}
    ],
    extra_body={"reasoning_effort": "high"}  
)

print(response.choices[0].message.content)

Include reasoning_effort directly in the request options:

import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "https://inference.baseten.co/v1",
    apiKey: process.env.BASETEN_API_KEY,
});

const response = await client.chat.completions.create({
    model: "deepseek-ai/DeepSeek-V4-Pro",
    messages: [
        { role: "user", content: "What is the sum of the first 100 prime numbers?" }
    ],
    reasoning_effort: "high"
});

console.log(response.choices[0].message.content);

Include reasoning_effort in the JSON request body:

curl https://inference.baseten.co/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
    "reasoning_effort": "high"
  }'

Python
JavaScript
cURL

Pass reasoning_effort through extra_body since it extends the standard OpenAI API:

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ.get("BASETEN_API_KEY")
)

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {"role": "user", "content": "What is the sum of the first 100 prime numbers?"}
    ],
    extra_body={"reasoning_effort": "high"}  
)

print(response.choices[0].message.content)

Include reasoning_effort directly in the request options:

import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "https://inference.baseten.co/v1",
    apiKey: process.env.BASETEN_API_KEY,
});

const response = await client.chat.completions.create({
    model: "openai/gpt-oss-120b",
    messages: [
        { role: "user", content: "What is the sum of the first 100 prime numbers?" }
    ],
    reasoning_effort: "high"
});

console.log(response.choices[0].message.content);

Include reasoning_effort in the JSON request body:

curl https://inference.baseten.co/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
    "reasoning_effort": "high"
  }'

Reasoning improves quality for tasks that benefit from step-by-step thinking: mathematical calculations, multi-step logic problems, code generation with complex requirements, and analysis requiring multiple considerations. For straightforward tasks like simple Q&A or text generation, reasoning adds latency and token cost without improving quality. In these cases, use a model without reasoning support or set reasoning_effort to low.

Parse the response

The model’s thinking process appears in reasoning_content, separate from the final answer in content. Both fields are returned on the message object.

Python
JavaScript
cURL

Read reasoning_content and content directly off the message object:

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ.get("BASETEN_API_KEY"),
)

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[{"role": "user", "content": "Is 91 a prime number? Answer in one sentence."}],
    extra_body={"chat_template_args": {"enable_thinking": True}},
)

message = response.choices[0].message
print("Reasoning:", message.reasoning_content)
print("Answer:", message.content)

Read reasoning_content and content from the returned message:

import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "https://inference.baseten.co/v1",
    apiKey: process.env.BASETEN_API_KEY,
});

const response = await client.chat.completions.create({
    model: "moonshotai/Kimi-K2.6",
    messages: [{ role: "user", content: "Is 91 a prime number? Answer in one sentence." }],
    chat_template_args: { enable_thinking: true },
});

const message = response.choices[0].message;
console.log("Reasoning:", message.reasoning_content);
console.log("Answer:", message.content);

Pipe the response through jq to extract each field:

curl https://inference.baseten.co/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "moonshotai/Kimi-K2.6",
    "messages": [{"role": "user", "content": "Is 91 a prime number? Answer in one sentence."}],
    "chat_template_args": {"enable_thinking": true}
  }' | jq '.choices[0].message | {reasoning: .reasoning_content, answer: .content}'

The response body contains both fields on the assistant message:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "reasoning_content": "The user is asking whether 91 is a prime number... 91 = 7 × 13, so it is not prime...",
        "content": "No, 91 is not a prime number because it can be factored as $7 \\times 13$."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 21,
    "completion_tokens": 203,
    "total_tokens": 224
  }
}

Reasoning tokens are included in completion_tokens and count toward your total usage and billing.

Get started

About Baseten

Model APIs

Inference

Development

Deployment

Engines

Frontier Gateway

Training

Organization

Observability

Troubleshooting

Supported models

Enable thinking

Control reasoning depth

Parse the response

Get started

About Baseten

Model APIs

Inference

Development

Deployment

Engines

Frontier Gateway

Training

Organization

Observability

Troubleshooting

Documentation Index

​Supported models

​Enable thinking

​Control reasoning depth

​Parse the response

Supported models

Enable thinking

Control reasoning depth

Parse the response