Creates a chat completion for the provided conversation. This endpoint is fully compatible with the OpenAI Chat Completions API, allowing you to use standard OpenAI SDKs by changing only the base URL and API key.
base_url and api_key when initializing the client:Use Api-Key as the scheme in the Authorization header: Authorization: Api-Key YOUR_API_KEY.
Request body for creating a chat completion.
A list of messages representing the conversation history. Supports roles: system, user, assistant, and tool.
The model slug to use for completion, such as deepseek-ai/DeepSeek-V3.1. Find available models at Model APIs.
Penalizes tokens based on how frequently they appear in the text so far. Positive values decrease repetition. Support varies by model.
A map of token IDs to bias values (-100 to 100). Use this to increase or decrease the likelihood of specific tokens appearing in the output.
If true, returns log probabilities of the output tokens. Log probability support varies by model.
Number of most likely tokens to return at each position (0-20). Requires logprobs: true. Log probability support varies by model.
Maximum number of tokens to generate. If your request input plus max_tokens exceeds the model's context length, max_tokens is truncated. If your request exceeds the context length by more than 16k tokens or if max_tokens signals no preference, context reservation is throttled to 49512 tokens. Higher max_tokens values slightly deprioritize request scheduling.
1 <= x <= 262144Number of completions to generate. Only 1 is supported.
Penalizes tokens based on whether they have appeared in the text so far. Positive values encourage the model to discuss new topics. Support varies by model.
Plain text response format.
Random seed for deterministic generation. Determinism is not guaranteed across different hardware or model versions.
Up to 32 sequences where the API stops generating further tokens. Can be a string or array of strings.
1 - 1000If true, responses are streamed back as server-sent events (SSE) as they are generated.
Options for streaming responses. Set include_usage: true to receive token usage statistics in the final chunk.
Controls randomness in the output. Lower values like 0.2 produce more focused and deterministic responses. Higher values like 1.5 produce more creative and varied output.
0 <= x <= 4Nucleus sampling: only consider tokens with cumulative probability up to this value. Lower values like 0.1 produce more focused output.
x <= 1A list of tools (functions) the model may call. Each tool should have a type: "function" and a function object with name, description, and parameters.
Controls which tool (if any) the model calls.
none: Never call a tool.auto: Model decides whether to call a tool.required: Model must call at least one tool.{"type": "function", "function": {"name": "..."}}: Call a specific function.none, required, auto If true, the model can call multiple tools in a single response.
A unique identifier for the end-user, useful for tracking and abuse detection.
Number of candidate sequences to generate and return the best from. Only a value of 1 is supported.
1 <= x <= 1Limits token selection to the top K most probable tokens at each step. Lower values like 10 produce more focused output. Set to -1 to disable.
Minimum value for dynamic top_p. When set, top_p dynamically adjusts but does not go below this value.
Minimum probability threshold for token selection. Filters out tokens with probability below min_p * max_probability.
Multiplicative penalty for repeated tokens. Values greater than 1.0 discourage repetition, values less than 1.0 encourage it.
Exponential penalty applied to sequence length during beam search. Values greater than 1.0 favor longer sequences.
If true, stops generation when at least n complete candidates are found.
Words or phrases to avoid in the output. Support varies by model.
Token IDs to avoid in the output. Support varies by model.
List of token IDs that cause generation to stop when encountered.
If true, includes the matched stop string in the output.
If true, continues generating past the end-of-sequence token.
Minimum number of tokens to generate before stopping. Useful for ensuring responses are not too short.
If true, removes special tokens from the generated output.
If true, adds spaces between special tokens in the output.
If set, truncates the prompt to this many tokens. Useful for handling inputs that may exceed context limits.
x >= 1If true and the last message role matches the generation role, prepends that message to the output.
If true, adds the generation prompt from the chat template, such as <|assistant|>. Set to false for completion-style generation.
If true, adds special tokens like BOS to the prompt beyond what the chat template adds. For most models, the chat template handles special tokens, so this should be false.
A list of documents for RAG (retrieval-augmented generation). Each document is a dict with string keys and values that the model can reference.
A custom Jinja template for formatting the conversation. If not provided, uses the model's default template.
Additional arguments to pass to the chat template renderer.
Advanced parameters for disaggregated serving. Used internally for distributed inference.
Successful response
A chat completion response returned by the model.
The model used for the completion.
A list of chat completion choices.
A unique identifier for the chat completion.
The object type, always chat.completion or chat.completion.chunk for streaming.
"chat.completion.chunk"The Unix timestamp (in seconds) of when the completion was created.
Token usage statistics for the request. Only present when streaming with stream_options.include_usage: true.