Skip to main content
Model APIs provide instant access to high-performance LLMs through OpenAI-compatible endpoints. Point your existing OpenAI SDK at Baseten’s inference endpoint and start making calls, no model deployment required. Unlike self-deployed models, where you configure hardware, engines, and scaling yourself, Model APIs run on shared infrastructure that Baseten manages. You get a fixed set of popular models with optimized serving out of the box. When you need a model that isn’t in the supported list, or want dedicated GPUs with custom scaling, deploy your own with Truss.

Supported models

Enable a model from the Model APIs page in the Baseten dashboard.
ModelSlugContext
DeepSeek V3 0324deepseek-ai/DeepSeek-V3-0324164k
DeepSeek V3.1deepseek-ai/DeepSeek-V3.1164k
GLM 4.6zai-org/GLM-4.6200k
GLM 4.7zai-org/GLM-4.7204K
GLM 5zai-org/GLM-5327k
Kimi K2 0905moonshotai/Kimi-K2-Instruct-0905128k
Kimi K2 Thinkingmoonshotai/Kimi-K2-Thinking262k
Kimi K2.5moonshotai/Kimi-K2.5262k
Minimax M2.5MiniMaxAI/MiniMax-M2.5204k
OpenAI GPT OSS 120Bopenai/gpt-oss-120b128k

Create a chat completion

If you’ve already completed the quickstart, you have a working client. The examples below show a multi-turn conversation with a system message, which you can adapt for your application.
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ["BASETEN_API_KEY"],
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1",
    messages=[
        {"role": "system", "content": "You are a concise technical writer."},
        {"role": "user", "content": "What is gradient descent?"},
        {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."},
        {"role": "user", "content": "How does the learning rate affect it?"}
    ],
)

print(response.choices[0].message.content)
Replace the model slug with any model from the supported models table.

Features

Model APIs support the full OpenAI Chat Completions API. You can generate structured outputs that conform to a JSON schema, use tool calling to let the model invoke functions you define, and enable reasoning for extended thinking on complex tasks. Set stream: true to receive responses as server-sent events. For the complete parameter reference, see the Chat Completions API documentation.

Migrate from OpenAI

To migrate existing OpenAI code to Baseten, change three values:
  1. Replace your API key with a Baseten API key.
  2. Change the base URL to https://inference.baseten.co/v1.
  3. Update the model name to a Baseten model slug.
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",  
    api_key=os.environ["BASETEN_API_KEY"]  
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1",  
    messages=[{"role": "user", "content": "Hello"}]
)

Handle errors

Model APIs return standard HTTP error codes:
CodeMeaning
400Invalid request (check your parameters)
401Invalid or missing API key
402Payment required
404Model not found
429Rate limit exceeded
500Internal server error
The response body contains details about the error and suggested resolutions.

Next steps