Skip to main content
Python
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ.get("BASETEN_API_KEY"),
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"},
    ],
)

print(response.choices[0].message.content)
{
  "model": "<string>",
  "choices": [
    {
      "index": 123,
      "delta": {
        "role": null,
        "content": null,
        "tool_calls": [
          {
            "index": 123,
            "function": {
              "arguments": "<string>",
              "name": null
            },
            "id": "<string>",
            "type": "function"
          }
        ]
      },
      "logprobs": null,
      "finish_reason": null,
      "stop_reason": 123
    }
  ],
  "id": "<string>",
  "object": "chat.completion.chunk",
  "created": 123,
  "usage": null
}
Download the OpenAPI schema for code generation and client libraries.
Model APIs provide instant access to high-performance open-source LLMs through an OpenAI-compatible endpoint.

Replace OpenAI with Baseten

Switching from OpenAI to Baseten takes two changes: the base URL and API key.
To switch to Baseten with the Python SDK, change base_url and api_key when initializing the client:
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ["BASETEN_API_KEY"],
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1",
    messages=[{"role": "user", "content": "Hello!"}],
)
Deploy a Model API to get started.
For detailed usage guides including structured outputs and tool calling, see Using Model APIs.

Authorizations

Authorization
string
header
required

Use Api-Key as the scheme in the Authorization header: Authorization: Api-Key YOUR_API_KEY.

Body

application/json

Request body for creating a chat completion.

messages
object[]
required

A list of messages representing the conversation history. Supports roles: system, user, assistant, and tool.

model
string
required

The model slug to use for completion, such as deepseek-ai/DeepSeek-V3.1. Find available models at Model APIs.

frequency_penalty
number
default:0

Penalizes tokens based on how frequently they appear in the text so far. Positive values decrease repetition. Support varies by model.

logit_bias
Logit Bias · object

A map of token IDs to bias values (-100 to 100). Use this to increase or decrease the likelihood of specific tokens appearing in the output.

logprobs
boolean
default:false

If true, returns log probabilities of the output tokens. Log probability support varies by model.

top_logprobs
integer
default:0

Number of most likely tokens to return at each position (0-20). Requires logprobs: true. Log probability support varies by model.

max_tokens
integer
default:4096

Maximum number of tokens to generate. If your request input plus max_tokens exceeds the model's context length, max_tokens is truncated. If your request exceeds the context length by more than 16k tokens or if max_tokens signals no preference, context reservation is throttled to 49512 tokens. Higher max_tokens values slightly deprioritize request scheduling.

Required range: 1 <= x <= 262144
n
integer
default:1

Number of completions to generate. Only 1 is supported.

presence_penalty
number
default:0

Penalizes tokens based on whether they have appeared in the text so far. Positive values encourage the model to discuss new topics. Support varies by model.

response_format
ResponseFormatText · object

Plain text response format.

seed
integer

Random seed for deterministic generation. Determinism is not guaranteed across different hardware or model versions.

stop

Up to 32 sequences where the API stops generating further tokens. Can be a string or array of strings.

Required string length: 1 - 1000
stream
boolean
default:false

If true, responses are streamed back as server-sent events (SSE) as they are generated.

stream_options
StreamOptions · object

Options for streaming responses. Set include_usage: true to receive token usage statistics in the final chunk.

temperature
number

Controls randomness in the output. Lower values like 0.2 produce more focused and deterministic responses. Higher values like 1.5 produce more creative and varied output.

Required range: 0 <= x <= 4
top_p
number
default:1

Nucleus sampling: only consider tokens with cumulative probability up to this value. Lower values like 0.1 produce more focused output.

Required range: x <= 1
tools
ChatCompletionToolsParam · object[]

A list of tools (functions) the model may call. Each tool should have a type: "function" and a function object with name, description, and parameters.

tool_choice

Controls which tool (if any) the model calls.

  • none: Never call a tool.
  • auto: Model decides whether to call a tool.
  • required: Model must call at least one tool.
  • {"type": "function", "function": {"name": "..."}}: Call a specific function.
Available options:
none,
required,
auto
parallel_tool_calls
boolean
default:true

If true, the model can call multiple tools in a single response.

user
string

A unique identifier for the end-user, useful for tracking and abuse detection.

best_of
integer

Number of candidate sequences to generate and return the best from. Only a value of 1 is supported.

Required range: 1 <= x <= 1
top_k
integer
default:50

Limits token selection to the top K most probable tokens at each step. Lower values like 10 produce more focused output. Set to -1 to disable.

top_p_min
number
default:0

Minimum value for dynamic top_p. When set, top_p dynamically adjusts but does not go below this value.

min_p
number
default:0

Minimum probability threshold for token selection. Filters out tokens with probability below min_p * max_probability.

repetition_penalty
number
default:1

Multiplicative penalty for repeated tokens. Values greater than 1.0 discourage repetition, values less than 1.0 encourage it.

length_penalty
number
default:1

Exponential penalty applied to sequence length during beam search. Values greater than 1.0 favor longer sequences.

early_stopping
boolean
default:false

If true, stops generation when at least n complete candidates are found.

bad

Words or phrases to avoid in the output. Support varies by model.

bad_token_ids
integer[]

Token IDs to avoid in the output. Support varies by model.

stop_token_ids
integer[]

List of token IDs that cause generation to stop when encountered.

include_stop_str_in_output
boolean
default:false

If true, includes the matched stop string in the output.

ignore_eos
boolean
default:false

If true, continues generating past the end-of-sequence token.

min_tokens
integer
default:0

Minimum number of tokens to generate before stopping. Useful for ensuring responses are not too short.

skip_special_tokens
boolean
default:true

If true, removes special tokens from the generated output.

spaces_between_special_tokens
boolean
default:true

If true, adds spaces between special tokens in the output.

truncate_prompt_tokens
integer

If set, truncates the prompt to this many tokens. Useful for handling inputs that may exceed context limits.

Required range: x >= 1
echo
boolean
default:false

If true and the last message role matches the generation role, prepends that message to the output.

add_generation_prompt
boolean
default:true

If true, adds the generation prompt from the chat template, such as <|assistant|>. Set to false for completion-style generation.

add_special_tokens
boolean
default:false

If true, adds special tokens like BOS to the prompt beyond what the chat template adds. For most models, the chat template handles special tokens, so this should be false.

documents
Documents · object[]

A list of documents for RAG (retrieval-augmented generation). Each document is a dict with string keys and values that the model can reference.

chat_template
string

A custom Jinja template for formatting the conversation. If not provided, uses the model's default template.

chat_template_args
Chat Template Args · object

Additional arguments to pass to the chat template renderer.

disaggregated_params
DisaggregatedParams · object

Advanced parameters for disaggregated serving. Used internally for distributed inference.

Response

Successful response

A chat completion response returned by the model.

model
string
required

The model used for the completion.

choices
ChatCompletionResponseStreamChoice · object[]
required

A list of chat completion choices.

id
string

A unique identifier for the chat completion.

object
string
default:chat.completion.chunk

The object type, always chat.completion or chat.completion.chunk for streaming.

Allowed value: "chat.completion.chunk"
created
integer

The Unix timestamp (in seconds) of when the completion was created.

usage
UsageInfo · object

Token usage statistics for the request. Only present when streaming with stream_options.include_usage: true.