Skip to main content
Every model running on Baseten is accessible over HTTPS through the inference API. The API provides two paths depending on how your model is served. Model APIs offer managed, high-performance LLMs through a single OpenAI-compatible endpoint, with no deployment step required. Deployed model endpoints serve custom models and chains that you package and deploy with Truss, each routed through a dedicated subdomain.

Model APIs

Model APIs give you instant access to popular open-source LLMs with optimized serving. Baseten manages the infrastructure — shared GPU clusters, model weights, and serving configuration — so there is no deployment step and nothing to configure. The supported catalog includes models like DeepSeek, GLM, and Kimi, with all models supporting tool calling and most supporting structured outputs. Pricing is per million tokens. Because Model APIs implement the OpenAI chat completions format, switching from OpenAI to Baseten requires only changing the base URL and API key in your existing client. All requests route through a single endpoint:
https://inference.baseten.co/v1/chat/completions
The Chat Completions reference covers request and response schemas. For usage details including structured outputs and tool calling, refer to the Model APIs guide.

Deployed model endpoints

When you deploy a custom model or chain with Truss, Baseten assigns it a dedicated subdomain for routing. This is the path for models that aren’t in the Model APIs catalog — models with custom serving logic, fine-tuned weights, or multi-step inference pipelines built as chains. You control the hardware, scaling behavior, and serving engine. Each endpoint URL includes a deployment target: an environment name like production, the development deployment, or a specific deployment ID. For models:
https://model-{model_id}.api.baseten.co/{deployment_type_or_id}/{endpoint}
For chains:
https://chain-{chain_id}.api.baseten.co/{deployment_type_or_id}/{endpoint}
  • model_id: the model’s alphanumeric ID, found in your model dashboard.
  • chain_id: the chain’s alphanumeric ID, found in your chain dashboard.
  • deployment_type_or_id: either development, production, or a specific deployment’s alphanumeric ID.
  • endpoint: the API action, such as predict.
For long-running tasks, the inference API supports asynchronous inference with priority queuing.

Predict endpoints

MethodEndpointDescription
POST/environments/{env_name}/predictCall an environment.
POST/development/predictCall the development deployment.
POST/deployment/{deployment_id}/predictCall any deployment.
POST/environments/{env_name}/async_predictFor Async inference, call the deployment associated with the specified environment.
POST/development/async_predictFor Async inference, call the development deployment.
POST/deployment/{deployment_id}/async_predictFor Async inference, call any published deployment of your model.
WEBSOCKET/environments/{env_name}/websocketFor WebSockets, connect to an environment.
WEBSOCKET/development/websocketFor WebSockets, connect to the development deployment.
WEBSOCKET/deployment/{deployment_id}/websocketFor WebSockets, connect to a deployment.

Async status endpoints

MethodEndpointDescription
GET/async_request/{request_id}Get the status of a model async request.
GET/async_request/{request_id}Get the status of a chain async request.
DEL/async_request/{request_id}Cancel an async request.
GET/environments/{env_name}/async_queue_statusGet the async queue status for a model associated with the specified environment.
GET/development/async_queue_statusGet the status of a development deployment’s async queue.
GET/deployment/{deployment_id}/async_queue_statusGet the status of a deployment’s async queue.

Wake endpoints

MethodEndpointDescription
POST/production/wakeWake the production environment of your model.
POST/development/wakeWake the development deployment of your model.
POST/deployment/{deployment_id}/wakeWake any deployment of your model.