Chat Completions

Download the OpenAPI schema for code generation and client libraries.

Model APIs provide instant access to high-performance open-source LLMs through an OpenAI-compatible endpoint.

Replace OpenAI with Baseten

Switching from OpenAI to Baseten takes two changes: the base URL and API key.

Python
JavaScript
cURL

To switch to Baseten with the Python SDK, change base_url and api_key when initializing the client:

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ["BASETEN_API_KEY"],
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1",
    messages=[{"role": "user", "content": "Hello!"}],
)

To switch to Baseten with the JavaScript SDK, change baseURL and apiKey when initializing the client:

import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "https://inference.baseten.co/v1",
    apiKey: process.env.BASETEN_API_KEY,
});

const response = await client.chat.completions.create({
    model: "deepseek-ai/DeepSeek-V3.1",
    messages: [{ role: "user", content: "Hello!" }],
});

To call Baseten with cURL, send a POST request to inference.baseten.co with your API key:

curl https://inference.baseten.co/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.1",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Deploy a Model API to get started.

For detailed usage guides including structured outputs and tool calling, see Using Model APIs.

OpenAI-compatible models you deploy yourself also support the chat completions format at their own base URL: https://model-{model_id}.api.baseten.co/v1/chat/completions. See deployed model endpoints for URL formats.

Authorizations

Authorization

string

header

required

Use Api-Key as the scheme in the Authorization header: Authorization: Api-Key YOUR_API_KEY.

Body

application/json

Request body for creating a chat completion.

messages

object[]

required

A list of messages representing the conversation history. Supports roles: system, user, assistant, and tool.

Show child attributes

model

string

required

The model slug to use for completion, such as deepseek-ai/DeepSeek-V3.1. Find available models at Model APIs.

frequency_penalty

number

default:0

Penalizes tokens based on how frequently they appear in the text so far. Positive values decrease repetition. Support varies by model.

logit_bias

Logit Bias · object

A map of token IDs to bias values (-100 to 100). Use this to increase or decrease the likelihood of specific tokens appearing in the output.

Show child attributes

logprobs

boolean

default:false

If true, returns log probabilities of the output tokens. Log probability support varies by model.

top_logprobs

integer

default:0

Number of most likely tokens to return at each position (0-20). Requires logprobs: true. Log probability support varies by model.

max_tokens

integer

default:4096

Maximum number of tokens to generate. If your request input plus max_tokens exceeds the model's context length, max_tokens is truncated. If your request exceeds the context length by more than 16k tokens or if max_tokens signals no preference, context reservation is throttled to 49512 tokens. Higher max_tokens values slightly deprioritize request scheduling.

Required range: 1 <= x <= 262144

integer

default:1

Number of completions to generate. Only 1 is supported.

presence_penalty

number

default:0

Penalizes tokens based on whether they have appeared in the text so far. Positive values encourage the model to discuss new topics. Support varies by model.

response_format

ResponseFormatText · object

Plain text response format.

ResponseFormatText
ResponseFormatJson
ResponseFormatJsonObject
ResponseFormatGrammar
ResponseFormatStructuralTag

Show child attributes

seed

integer

Random seed for deterministic generation. Determinism is not guaranteed across different hardware or model versions.

stop

Up to 32 sequences where the API stops generating further tokens. Can be a string or array of strings.

Required string length: 1 - 1000

stream

boolean

default:false

If true, responses are streamed back as server-sent events (SSE) as they are generated.

stream_options

StreamOptions · object

Options for streaming responses. Set include_usage: true to receive token usage statistics in the final chunk.

Show child attributes

temperature

number

Controls randomness in the output. Lower values like 0.2 produce more focused and deterministic responses. Higher values like 1.5 produce more creative and varied output.

Required range: 0 <= x <= 4

top_p

number

default:1

Nucleus sampling: only consider tokens with cumulative probability up to this value. Lower values like 0.1 produce more focused output.

Required range: x <= 1

tools

ChatCompletionToolsParam · object[]

A list of tools (functions) the model may call. Each tool should have a type: "function" and a function object with name, description, and parameters.

Show child attributes

tool_choice

Controls which tool (if any) the model calls.

none: Never call a tool.
auto: Model decides whether to call a tool.
required: Model must call at least one tool.
{"type": "function", "function": {"name": "..."}}: Call a specific function.

Available options:

none,

required,

auto

parallel_tool_calls

boolean

default:true

If true, the model can call multiple tools in a single response.

user

string

A unique identifier for the end-user, useful for tracking and abuse detection.

best_of

integer

Number of candidate sequences to generate and return the best from. Only a value of 1 is supported.

Required range: 1 <= x <= 1

top_k

integer

default:50

Limits token selection to the top K most probable tokens at each step. Lower values like 10 produce more focused output. Set to -1 to disable.

top_p_min

number

default:0

Minimum value for dynamic top_p. When set, top_p dynamically adjusts but does not go below this value.

min_p

number

default:0

Minimum probability threshold for token selection. Filters out tokens with probability below min_p * max_probability.

repetition_penalty

number

default:1

Multiplicative penalty for repeated tokens. Values greater than 1.0 discourage repetition, values less than 1.0 encourage it.

length_penalty

number

default:1

Exponential penalty applied to sequence length during beam search. Values greater than 1.0 favor longer sequences.

early_stopping

boolean

default:false

If true, stops generation when at least n complete candidates are found.

bad

Words or phrases to avoid in the output. Support varies by model.

bad_token_ids

integer[]

Token IDs to avoid in the output. Support varies by model.

stop_token_ids

integer[]

List of token IDs that cause generation to stop when encountered.

include_stop_str_in_output

boolean

default:false

If true, includes the matched stop string in the output.

ignore_eos

boolean

default:false

If true, continues generating past the end-of-sequence token.

min_tokens

integer

default:0

Minimum number of tokens to generate before stopping. Useful for ensuring responses are not too short.

skip_special_tokens

boolean

default:true

If true, removes special tokens from the generated output.

spaces_between_special_tokens

boolean

default:true

If true, adds spaces between special tokens in the output.

truncate_prompt_tokens

integer

If set, truncates the prompt to this many tokens. Useful for handling inputs that may exceed context limits.

Required range: x >= 1

echo

boolean

default:false

If true and the last message role matches the generation role, prepends that message to the output.

add_generation_prompt

boolean

default:true

If true, adds the generation prompt from the chat template, such as <|assistant|>. Set to false for completion-style generation.

add_special_tokens

boolean

default:false

If true, adds special tokens like BOS to the prompt beyond what the chat template adds. For most models, the chat template handles special tokens, so this should be false.

documents

Documents · object[]

A list of documents for RAG (retrieval-augmented generation). Each document is a dict with string keys and values that the model can reference.

Show child attributes

chat_template

string

A custom Jinja template for formatting the conversation. If not provided, uses the model's default template.

chat_template_args

Chat Template Args · object

Additional arguments to pass to the chat template renderer.

disaggregated_params

DisaggregatedParams · object

Advanced parameters for disaggregated serving. Used internally for distributed inference.

Show child attributes

Response

Successful response

A chat completion response returned by the model.

model

string

required

The model used for the completion.

choices

ChatCompletionResponseStreamChoice · object[]

required

A list of chat completion choices.

Show child attributes

string

A unique identifier for the chat completion.

object

string

default:chat.completion.chunk

The object type, always chat.completion or chat.completion.chunk for streaming.

Allowed value: "chat.completion.chunk"

created

integer

The Unix timestamp (in seconds) of when the completion was created.

usage

UsageInfo · object

Token usage statistics for the request. Only present when streaming with stream_options.include_usage: true.

Show child attributes

Reference

Inference API

Management API

Training API

CLI reference

SDK reference

CI/CD

Chat Completions

Replace OpenAI with Baseten

Authorizations

Body

Response

Reference

Inference API

Management API

Training API

CLI reference

SDK reference

CI/CD

​Replace OpenAI with Baseten

Authorizations

Body

Response

Replace OpenAI with Baseten