Skip to main content
Model APIs provide instant access to high-performance LLMs through OpenAI-compatible endpoints. Point your existing OpenAI SDK at Baseten’s inference endpoint and start making calls, no model deployment required. Unlike self-deployed models, where you’d configure hardware, engines, and scaling yourself, Model APIs run on shared infrastructure that Baseten manages. You get a fixed set of popular models with optimized serving out of the box. When you need a model that isn’t in the supported list, or want dedicated GPUs with custom scaling, deploy your own with Truss and call it through the predict API.

Supported models

Enable a model from the Model APIs page in the Baseten dashboard.

Pricing

Pricing is per million tokens. Every request participates in caching automatically — there are no flags to set. When a request’s prefix matches a previously-cached prefix, those tokens are billed at the cache input rate; all other input tokens are billed at the uncached input rate.

Feature support

All models support tool calling. Support for other features varies by model. See Reasoning for configuration details.
GLM models and Nemotron Super also support top_p and top_k sampling parameters.

Create a chat completion

If you’ve already completed the quickstart, you have a working client. The examples below show a multi-turn conversation with a system message that you can adapt for your application.
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ["BASETEN_API_KEY"],
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1",
    messages=[
        {"role": "system", "content": "You are a concise technical writer."},
        {"role": "user", "content": "What is gradient descent?"},
        {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."},
        {"role": "user", "content": "How does the learning rate affect it?"}
    ],
)

print(response.choices[0].message.content)
Replace the model slug with any model from the supported models table.

Compatible features

Model APIs follow the OpenAI Chat Completions API, so you can use structured outputs, tool calling, reasoning, vision, and streaming (stream: true) with the same parameters you’d use with OpenAI. Check the feature support table for per-model availability. For the complete parameter reference, see the Chat Completions API documentation.

List available models

Query the /v1/models endpoint for the current list of models with metadata including pricing, context lengths, and supported features.
curl https://inference.baseten.co/v1/models \
  -H "Authorization: Api-Key $BASETEN_API_KEY"

Migrate from OpenAI

To migrate existing OpenAI code to Baseten, change three values:
  1. Replace your API key with a Baseten API key.
  2. Change the base URL to https://inference.baseten.co/v1.
  3. Update the model name to a Baseten model slug.
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",  
    api_key=os.environ["BASETEN_API_KEY"]  
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1",  
    messages=[{"role": "user", "content": "Hello"}]
)

Handle errors

Model APIs return standard HTTP error codes:
CodeMeaning
400Invalid request (check your parameters)
401Invalid or missing API key
402Payment required
404Model not found
429Rate limit exceeded
500Internal server error
Each error response includes a JSON body with details about the issue and suggested resolutions.

Next steps

Reasoning

Control extended thinking for complex tasks

Vision

Send images and videos alongside text

Rate limits

Understand and configure rate limits

API reference

Complete parameter documentation