OpenAI-compatible endpoints for high-performance LLMs
Model APIs provide instant access to high-performance LLMs through endpoints that are compatible with both the OpenAI Chat Completions API and the Anthropic Messages API. Point your existing OpenAI or Anthropic SDK at Baseten’s inference endpoint and start making calls, no model deployment required.Unlike self-deployed models, where you’d configure hardware, engines, and scaling yourself, Model APIs run on shared infrastructure that Baseten manages. You get a fixed set of popular models with optimized serving out of the box. When you need a model that isn’t in the supported list, or want dedicated GPUs with custom scaling, deploy your own with Truss and call it through the predict API.
Pricing is per million tokens. Every request participates in caching
automatically — there are no flags to set. When a request’s prefix matches a
previously-cached prefix, those tokens are billed at the cache input rate; all
other input tokens are billed at the uncached input rate.
from openai import OpenAIimport osclient = OpenAI( base_url="https://inference.baseten.co/v1", api_key=os.environ["BASETEN_API_KEY"],)response = client.chat.completions.create( model="deepseek-ai/DeepSeek-V4-Pro", messages=[ {"role": "system", "content": "You are a concise technical writer."}, {"role": "user", "content": "What is gradient descent?"}, {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."}, {"role": "user", "content": "How does the learning rate affect it?"} ],)print(response.choices[0].message.content)
import OpenAI from "openai";const client = new OpenAI({ baseURL: "https://inference.baseten.co/v1", apiKey: process.env.BASETEN_API_KEY,});const response = await client.chat.completions.create({ model: "deepseek-ai/DeepSeek-V4-Pro", messages: [ { role: "system", content: "You are a concise technical writer." }, { role: "user", content: "What is gradient descent?" }, { role: "assistant", content: "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function." }, { role: "user", content: "How does the learning rate affect it?" } ],});console.log(response.choices[0].message.content);
curl https://inference.baseten.co/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $BASETEN_API_KEY" \ -d '{ "model": "deepseek-ai/DeepSeek-V4-Pro", "messages": [ {"role": "system", "content": "You are a concise technical writer."}, {"role": "user", "content": "What is gradient descent?"}, {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."}, {"role": "user", "content": "How does the learning rate affect it?"} ] }'
Replace the model slug with any model from the supported models table.