Inference on Baseten is the path from your application to a model running in Baseten’s infrastructure, whether you use Model APIs for hosted models or deploy your own with Truss. You don’t provision GPUs or build your own routing layer: Baseten authenticates each request, matches it to a deployment environment, and runs it on a replica. The sections below follow how you choose an API, how you want the response delivered (synchronous, streaming, or asynchronous), structured outputs and tool calling, and client tuning, and they assume you already have a Baseten account and an API key. To call popular open models without a Truss project first, use the public OpenAI-compatible endpoint atDocumentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
https://inference.baseten.co/v1 with your Baseten API key and the OpenAI SDK pointed at that base URL. The Model APIs documentation lists models, pricing, and feature support. For what happens after the gateway (routing, replicas, queuing, retries, cold starts), see Request lifecycle.
If you’re an AI lab serving your own hosted model to your own customers under a branded URL, with federated keys and per-customer billing, see Frontier Gateway.
Inference API
When you deploy your own model, pick an interface that matches your payloads. Engine-Builder-LLM, BIS-LLM, and BEI expose/v1/chat/completions (or /v1/embeddings for BEI) on https://inference.baseten.co/v1 with OpenAI-compatible parameters for structured outputs, tool calling, reasoning, and streaming. Custom Truss code can use /predict for arbitrary JSON when chat or embeddings are not a good fit. Use the Inference API reference for paths, methods, and errors.