Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

Model APIs provide instant access to high-performance LLMs through endpoints that are compatible with both the OpenAI Chat Completions API and the Anthropic Messages API. Point your existing OpenAI or Anthropic SDK at Baseten’s inference endpoint and start making calls, no model deployment required. Unlike dedicated deployments, where you’d configure hardware, engines, and scaling yourself, Model APIs run on shared infrastructure that Baseten manages. You get a fixed set of popular models with optimized serving out of the box. When you need a model that isn’t in the supported list, or want dedicated GPUs with custom scaling, deploy your own with Truss.

Supported models

Run inference against any Model API to get started.

Pricing

Model APIs bill per million tokens. For current per-model rates, see the Model APIs pricing page. Cached input tokens are prompt tokens served from the KV cache, billed at a discounted rate. Every request participates in caching automatically, with no flags or opt-in steps.

Feature support

All models support tool calling (also known as function calling), structured outputs, and JSON mode. See the table below for per-model coverage of reasoning and vision. For reasoning-specific configuration, see Reasoning. For image and video inputs, see Vision.
GLM models and Nemotron Super also support top_p and top_k sampling parameters.

Run inference

Model APIs support both OpenAI’s Chat Completions and Anthropic’s Messages APIs. Set your base URL, API key, and model name to start making requests.

Use the OpenAI SDK

Call supported models using the OpenAI Chat Completions API at https://inference.baseten.co/v1/chat/completions.
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ["BASETEN_API_KEY"],
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {"role": "system", "content": "You are a concise technical writer."},
        {"role": "user", "content": "What is gradient descent?"},
        {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."},
        {"role": "user", "content": "How does the learning rate affect it?"}
    ],
)

print(response.choices[0].message.content)
Replace the model slug with any model from the supported models table.

Use the Anthropic SDK

Call supported models using the Anthropic Messages API at https://inference.baseten.co/v1/messages.
import anthropic
import os

API_KEY = os.environ["BASETEN_API_KEY"]

client = anthropic.Anthropic(
    base_url="https://inference.baseten.co",
    api_key=API_KEY,
    default_headers={"Authorization": f"Bearer {API_KEY}"},
)

response = client.messages.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    max_tokens=4096,
    system="You are a concise technical writer.",
    messages=[
        {"role": "user", "content": "What is gradient descent?"},
        {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."},
        {"role": "user", "content": "How does the learning rate affect it?"}
    ],
)

for block in response.content:
    if block.type == "text":
        print(block.text)
The Anthropic SDK sends the API key as x-api-key by default. Baseten reads Authorization, so override default_headers as shown.

List available models

Query the /v1/models endpoint for the current list of models with metadata including pricing, context lengths, and supported features.
curl https://inference.baseten.co/v1/models \
  -H "Authorization: Bearer $BASETEN_API_KEY"

Migrate

To migrate to Baseten, change the base URL, API key, and model name.
  1. Replace your OpenAI API key with a Baseten API key.
  2. Change the base URL to https://inference.baseten.co/v1.
  3. Update the model name to a Baseten model slug.
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",  
    api_key=os.environ["BASETEN_API_KEY"]  
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",  
    messages=[{"role": "user", "content": "Hello"}]
)

Handle errors

Model APIs return standard HTTP error codes:
CodeMeaning
400Invalid request (check your parameters)
401Invalid or missing API key
402Payment required
404Model not found
429Rate limit exceeded
500Internal server error
Each error response includes a JSON body with details about the issue and suggested resolutions.

Next steps

Reasoning

Control extended thinking for complex tasks

Vision

Send images and videos alongside text

Rate limits

Understand and configure rate limits

API reference

Complete parameter documentation