Use this file to discover all available pages before exploring further.
Model APIs provide instant access to high-performance LLMs through endpoints that are compatible with both the OpenAI Chat Completions API and the Anthropic Messages API. Point your existing OpenAI or Anthropic SDK at Baseten’s inference endpoint and start making calls, no model deployment required.Unlike dedicated deployments, where you’d configure hardware, engines, and scaling yourself, Model APIs run on shared infrastructure that Baseten manages. You get a fixed set of popular models with optimized serving out of the box. When you need a model that isn’t in the supported list, or want dedicated GPUs with custom scaling, deploy your own with Truss.
Model APIs bill per million tokens.
For current per-model rates, see the Model APIs pricing page.Cached input tokens are prompt tokens served from the KV cache, billed at a discounted rate.
Every request participates in caching automatically, with no flags or opt-in steps.
All models support tool calling (also known as function calling), structured outputs, and JSON mode. See the table below for per-model coverage of reasoning and vision. For reasoning-specific configuration, see Reasoning. For image and video inputs, see Vision.
GLM models and Nemotron Super also support top_p and top_k sampling parameters.
Call supported models using the OpenAI Chat Completions API at https://inference.baseten.co/v1/chat/completions.
Python
JavaScript
cURL
from openai import OpenAIimport osclient = OpenAI( base_url="https://inference.baseten.co/v1", api_key=os.environ["BASETEN_API_KEY"],)response = client.chat.completions.create( model="deepseek-ai/DeepSeek-V4-Pro", messages=[ {"role": "system", "content": "You are a concise technical writer."}, {"role": "user", "content": "What is gradient descent?"}, {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."}, {"role": "user", "content": "How does the learning rate affect it?"} ],)print(response.choices[0].message.content)
import OpenAI from "openai";const client = new OpenAI({ baseURL: "https://inference.baseten.co/v1", apiKey: process.env.BASETEN_API_KEY,});const response = await client.chat.completions.create({ model: "deepseek-ai/DeepSeek-V4-Pro", messages: [ { role: "system", content: "You are a concise technical writer." }, { role: "user", content: "What is gradient descent?" }, { role: "assistant", content: "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function." }, { role: "user", content: "How does the learning rate affect it?" } ],});console.log(response.choices[0].message.content);
curl https://inference.baseten.co/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $BASETEN_API_KEY" \ -d '{ "model": "deepseek-ai/DeepSeek-V4-Pro", "messages": [ {"role": "system", "content": "You are a concise technical writer."}, {"role": "user", "content": "What is gradient descent?"}, {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."}, {"role": "user", "content": "How does the learning rate affect it?"} ] }'
Replace the model slug with any model from the supported models table.
Call supported models using the Anthropic Messages API at https://inference.baseten.co/v1/messages.
Python
JavaScript
cURL
import anthropicimport osAPI_KEY = os.environ["BASETEN_API_KEY"]client = anthropic.Anthropic( base_url="https://inference.baseten.co", api_key=API_KEY, default_headers={"Authorization": f"Bearer {API_KEY}"},)response = client.messages.create( model="deepseek-ai/DeepSeek-V4-Pro", max_tokens=4096, system="You are a concise technical writer.", messages=[ {"role": "user", "content": "What is gradient descent?"}, {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."}, {"role": "user", "content": "How does the learning rate affect it?"} ],)for block in response.content: if block.type == "text": print(block.text)
import Anthropic from "@anthropic-ai/sdk";const apiKey = process.env.BASETEN_API_KEY;const client = new Anthropic({ baseURL: "https://inference.baseten.co", apiKey: apiKey, defaultHeaders: { Authorization: `Bearer ${apiKey}` },});const response = await client.messages.create({ model: "deepseek-ai/DeepSeek-V4-Pro", max_tokens: 4096, system: "You are a concise technical writer.", messages: [ { role: "user", content: "What is gradient descent?" }, { role: "assistant", content: "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function." }, { role: "user", content: "How does the learning rate affect it?" } ],});for (const block of response.content) { if (block.type === "text") console.log(block.text);}
curl https://inference.baseten.co/v1/messages \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $BASETEN_API_KEY" \ -d '{ "model": "deepseek-ai/DeepSeek-V4-Pro", "max_tokens": 4096, "system": "You are a concise technical writer.", "messages": [ {"role": "user", "content": "What is gradient descent?"}, {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."}, {"role": "user", "content": "How does the learning rate affect it?"} ] }'
The Anthropic SDK sends the API key as x-api-key by default. Baseten reads Authorization, so override default_headers as shown.