Model APIs

Model APIs provide instant access to high-performance LLMs through endpoints that are compatible with both the OpenAI Chat Completions API and the Anthropic Messages API (beta). Point your existing OpenAI or Anthropic SDK at Baseten’s inference endpoint and start making calls, no model deployment required. Unlike dedicated deployments, where you’d configure hardware, engines, and scaling yourself, Model APIs run on shared infrastructure that Baseten manages. You get a fixed set of popular models with optimized serving out of the box. When you need a model that isn’t in the supported list, or want dedicated GPUs with custom scaling, deploy your own with Truss.

Supported models

Run inference against any Model API to get started.

Context and output limits reflect Baseten’s live serving configuration, which can differ from a model’s advertised native maximum. We extend limits as they meet our performance bar; this table and /v1/models always reflect what’s currently served.

Pricing

Model APIs bill per million tokens. For current per-model rates, see the Model APIs pricing page. Cached input tokens are prompt tokens served from the KV cache, billed at a discounted rate. Every request participates in caching automatically, with no flags or opt-in steps.

Feature support

All models support tool calling (also known as function calling), structured outputs, and JSON mode. See the table below for per-model coverage of reasoning, vision, and audio. For reasoning-specific configuration, see Reasoning. For image and video inputs, see Vision. For audio inputs, see Audio.

GLM models, Nemotron Super, and Nemotron Ultra also support top_p and top_k sampling parameters.

Run inference

Model APIs support both OpenAI’s Chat Completions and Anthropic’s Messages APIs. Set your base URL, API key, and model name to start making requests.

Use the OpenAI SDK

Call supported models using the OpenAI Chat Completions API at https://inference.baseten.co/v1/chat/completions.

Python
JavaScript
cURL

To call a model with the Chat Completions API:

chat_completions.py

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ["BASETEN_API_KEY"],
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {"role": "system", "content": "You are a concise technical writer."},
        {"role": "user", "content": "What is gradient descent?"},
        {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."},
        {"role": "user", "content": "How does the learning rate affect it?"}
    ],
)

print(response.choices[0].message.content)

To call a model with the Chat Completions API:

chat_completions.js

import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "https://inference.baseten.co/v1",
    apiKey: process.env.BASETEN_API_KEY,
});

const response = await client.chat.completions.create({
    model: "deepseek-ai/DeepSeek-V4-Pro",
    messages: [
        { role: "system", content: "You are a concise technical writer." },
        { role: "user", content: "What is gradient descent?" },
        { role: "assistant", content: "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function." },
        { role: "user", content: "How does the learning rate affect it?" }
    ],
});

console.log(response.choices[0].message.content);

To call a model with the Chat Completions API:

Request

curl https://inference.baseten.co/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "messages": [
      {"role": "system", "content": "You are a concise technical writer."},
      {"role": "user", "content": "What is gradient descent?"},
      {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."},
      {"role": "user", "content": "How does the learning rate affect it?"}
    ]
  }'

Replace the model slug with any model from the supported models table.

Use the Anthropic SDK

Call supported models using the Anthropic Messages API at https://inference.baseten.co/v1/messages.

Anthropic Messages API support is in beta. Behavior may change before general availability. For production workloads, use the OpenAI Chat Completions API.

Python
JavaScript
cURL

To call a model with the Messages API:

messages_api.py

import anthropic
import os

API_KEY = os.environ["BASETEN_API_KEY"]

client = anthropic.Anthropic(
    base_url="https://inference.baseten.co",
    api_key=API_KEY,
    default_headers={"Authorization": f"Bearer {API_KEY}"},
)

response = client.messages.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    max_tokens=4096,
    system="You are a concise technical writer.",
    messages=[
        {"role": "user", "content": "What is gradient descent?"},
        {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."},
        {"role": "user", "content": "How does the learning rate affect it?"}
    ],
)

for block in response.content:
    if block.type == "text":
        print(block.text)

To call a model with the Messages API:

messages_api.js

import Anthropic from "@anthropic-ai/sdk";

const apiKey = process.env.BASETEN_API_KEY;

const client = new Anthropic({
    baseURL: "https://inference.baseten.co",
    apiKey: apiKey,
    defaultHeaders: { Authorization: `Bearer ${apiKey}` },
});

const response = await client.messages.create({
    model: "deepseek-ai/DeepSeek-V4-Pro",
    max_tokens: 4096,
    system: "You are a concise technical writer.",
    messages: [
        { role: "user", content: "What is gradient descent?" },
        { role: "assistant", content: "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function." },
        { role: "user", content: "How does the learning rate affect it?" }
    ],
});

for (const block of response.content) {
    if (block.type === "text") console.log(block.text);
}

To call a model with the Messages API:

Request

curl https://inference.baseten.co/v1/messages \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Pro",
    "max_tokens": 4096,
    "system": "You are a concise technical writer.",
    "messages": [
      {"role": "user", "content": "What is gradient descent?"},
      {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."},
      {"role": "user", "content": "How does the learning rate affect it?"}
    ]
  }'

The Anthropic SDK sends the API key as x-api-key by default. Baseten reads Authorization, so override default_headers as shown.

List available models

Query the /v1/models endpoint for the current list of models with metadata including pricing, context windows, and supported features:

Request

curl https://inference.baseten.co/v1/models \
  -H "Authorization: Bearer $BASETEN_API_KEY"

Migrate

To migrate to Baseten, change the base URL, API key, and model name.

OpenAI SDK
Anthropic SDK

To migrate from the OpenAI SDK:

Replace your OpenAI API key with a Baseten API key.
Change the base URL to https://inference.baseten.co/v1.
Update the model name to a Baseten model slug.

migrate.py

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",  
    api_key=os.environ["BASETEN_API_KEY"]  
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",  
    messages=[{"role": "user", "content": "Hello"}]
)

To migrate from the Anthropic SDK:

Replace your Anthropic API key with a Baseten API key.
Change the base URL to https://inference.baseten.co.
Override default_headers so the SDK sends Authorization instead of x-api-key.
Update the model name to a supported Baseten model slug.

migrate.py

import anthropic
import os

API_KEY = os.environ["BASETEN_API_KEY"]

client = anthropic.Anthropic(
    base_url="https://inference.baseten.co",  
    api_key=API_KEY,  
    default_headers={"Authorization": f"Bearer {API_KEY}"},  
)

response = client.messages.create(
    model="deepseek-ai/DeepSeek-V4-Pro",  
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

Handle errors

Model APIs return standard HTTP error codes:

Code	Meaning
400	Invalid request (check your parameters)
401	Invalid or missing API key
402	Payment required
404	Model not found
429	Rate limit exceeded
500	Internal server error

Each error response includes a JSON body with details about the issue and suggested resolutions.

Next steps

Reasoning

Control extended thinking for complex tasks

Vision

Send images and videos alongside text

Audio

Send audio alongside text

Rate limits

Understand and configure rate limits

API reference

Complete parameter documentation

Overview

Get started

Inference

Development

Deployment

Engines

Frontier Gateway

Training

Organization

Observability

Troubleshooting

Model APIs

Supported models

Pricing

Feature support

Run inference

Use the OpenAI SDK

Use the Anthropic SDK

List available models

Migrate

Handle errors

Next steps

Reasoning

Vision

Audio

Rate limits

API reference

​Supported models

​Pricing

​Feature support

​Run inference

​Use the OpenAI SDK

​Use the Anthropic SDK

​List available models

​Migrate

​Handle errors

​Next steps

Reasoning

Vision

Audio

Rate limits

API reference

Supported models

Pricing

Feature support

Run inference

Use the OpenAI SDK

Use the Anthropic SDK

List available models

Migrate

Handle errors

Next steps