Skip to main content
Model APIs provide instant access to high-performance LLMs through endpoints that are compatible with both the OpenAI Chat Completions API and the Anthropic Messages API. Point your existing OpenAI or Anthropic SDK at Baseten’s inference endpoint and start making calls, no model deployment required. Unlike self-deployed models, where you’d configure hardware, engines, and scaling yourself, Model APIs run on shared infrastructure that Baseten manages. You get a fixed set of popular models with optimized serving out of the box. When you need a model that isn’t in the supported list, or want dedicated GPUs with custom scaling, deploy your own with Truss and call it through the predict API.

Supported models

Enable a model from the Model APIs page in the Baseten dashboard.

Pricing

Pricing is per million tokens. Every request participates in caching automatically — there are no flags to set. When a request’s prefix matches a previously-cached prefix, those tokens are billed at the cache input rate; all other input tokens are billed at the uncached input rate.

Feature support

All models support tool calling. Support for other features varies by model. See Reasoning for configuration details.
GLM models and Nemotron Super also support top_p and top_k sampling parameters.

Create a chat completion

If you’ve already completed the quickstart, you have a working client. Use the OpenAI SDK or the Anthropic SDK against any supported model.

Use the OpenAI SDK

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ["BASETEN_API_KEY"],
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {"role": "system", "content": "You are a concise technical writer."},
        {"role": "user", "content": "What is gradient descent?"},
        {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."},
        {"role": "user", "content": "How does the learning rate affect it?"}
    ],
)

print(response.choices[0].message.content)
Replace the model slug with any model from the supported models table.

Use the Anthropic SDK

You can also call supported models using the Anthropic Messages API at https://inference.baseten.co/v1/messages.
import anthropic
import os

API_KEY = os.environ["BASETEN_API_KEY"]

client = anthropic.Anthropic(
    base_url="https://inference.baseten.co",
    api_key=API_KEY,
    default_headers={"Authorization": f"Bearer {API_KEY}"},
)

response = client.messages.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "What is gradient descent?"}
    ],
)

print(response.content[0].text)
The Anthropic SDK sends the API key as x-api-key by default. Baseten reads Authorization, so override default_headers as shown.

Compatible features

Model APIs follow the OpenAI Chat Completions API, so you can use structured outputs, tool calling, reasoning, vision, and streaming (stream: true) with the same parameters you’d use with OpenAI. Check the feature support table for per-model availability. For the complete parameter reference, see the Chat Completions API documentation.

List available models

Query the /v1/models endpoint for the current list of models with metadata including pricing, context lengths, and supported features.
curl https://inference.baseten.co/v1/models \
  -H "Authorization: Bearer $BASETEN_API_KEY"

Migrate

To migrate to Baseten, change the base URL, API key, and model name.
  1. Replace your OpenAI API key with a Baseten API key.
  2. Change the base URL to https://inference.baseten.co/v1.
  3. Update the model name to a Baseten model slug.
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",  
    api_key=os.environ["BASETEN_API_KEY"]  
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",  
    messages=[{"role": "user", "content": "Hello"}]
)

Handle errors

Model APIs return standard HTTP error codes:
CodeMeaning
400Invalid request (check your parameters)
401Invalid or missing API key
402Payment required
404Model not found
429Rate limit exceeded
500Internal server error
Each error response includes a JSON body with details about the issue and suggested resolutions.

Next steps

Reasoning

Control extended thinking for complex tasks

Vision

Send images and videos alongside text

Rate limits

Understand and configure rate limits

API reference

Complete parameter documentation