OpenAI-compatible endpoints for high-performance LLMs
Model APIs provide instant access to high-performance LLMs through OpenAI-compatible endpoints. Point your existing OpenAI SDK at Baseten’s inference endpoint and start making calls, no model deployment required.Unlike self-deployed models, where you’d configure hardware, engines, and scaling yourself, Model APIs run on shared infrastructure that Baseten manages. You get a fixed set of popular models with optimized serving out of the box. When you need a model that isn’t in the supported list, or want dedicated GPUs with custom scaling, deploy your own with Truss and call it through the predict API.
Pricing is per million tokens. Every request participates in caching
automatically — there are no flags to set. When a request’s prefix matches a
previously-cached prefix, those tokens are billed at the cache input rate; all
other input tokens are billed at the uncached input rate.
If you’ve already completed the quickstart, you have a working client. The examples below show a multi-turn conversation with a system message that you can adapt for your application.
Python
JavaScript
cURL
from openai import OpenAIimport osclient = OpenAI( base_url="https://inference.baseten.co/v1", api_key=os.environ["BASETEN_API_KEY"],)response = client.chat.completions.create( model="deepseek-ai/DeepSeek-V3.1", messages=[ {"role": "system", "content": "You are a concise technical writer."}, {"role": "user", "content": "What is gradient descent?"}, {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."}, {"role": "user", "content": "How does the learning rate affect it?"} ],)print(response.choices[0].message.content)
import OpenAI from "openai";const client = new OpenAI({ baseURL: "https://inference.baseten.co/v1", apiKey: process.env.BASETEN_API_KEY,});const response = await client.chat.completions.create({ model: "deepseek-ai/DeepSeek-V3.1", messages: [ { role: "system", content: "You are a concise technical writer." }, { role: "user", content: "What is gradient descent?" }, { role: "assistant", content: "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function." }, { role: "user", content: "How does the learning rate affect it?" } ],});console.log(response.choices[0].message.content);
curl https://inference.baseten.co/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Api-Key $BASETEN_API_KEY" \ -d '{ "model": "deepseek-ai/DeepSeek-V3.1", "messages": [ {"role": "system", "content": "You are a concise technical writer."}, {"role": "user", "content": "What is gradient descent?"}, {"role": "assistant", "content": "An optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function."}, {"role": "user", "content": "How does the learning rate affect it?"} ] }'
Replace the model slug with any model from the supported models table.