Skip to main content
This page covers calling self-deployed models. For hosted open-source models with no deployment step, see Model APIs. Once deployed, your model is accessible through an API endpoint. To make an inference request, you’ll need:
  • Model ID: Found in the Baseten dashboard or returned when you deploy.
  • API key: Authenticates your requests.
  • JSON-serializable model input: The data your model expects.

Authentication

Include your API key in the Authorization header:
curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/predict \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, world!"}'
In Python with requests:
import requests
import os

api_key = os.environ["BASETEN_API_KEY"]
model_id = "YOUR_MODEL_ID"

response = requests.post(
    f"https://model-{model_id}.api.baseten.co/environments/production/predict",
    headers={"Authorization": f"Api-Key {api_key}"},
    json={"prompt": "Hello, world!"},
)

print(response.json())

Predict API endpoints

Baseten provides multiple endpoints for different inference modes: Endpoints are available for environments and all deployments. See the API reference for details.

Sync API endpoints

Custom servers support both predict endpoints as well as a special sync endpoint. By using the sync endpoint you are able to call different routes in your custom server.
https://model-{model-id}.api.baseten.co/environments/{production}/sync/{route}
Here are a few examples that show how the sync endpoint maps to the custom server’s routes.
  • https://model-{model_id}.../sync/health -> /health
  • https://model-{model_id}.../sync/items -> /items
  • https://model-{model_id}.../sync/items/123 -> /items/123

OpenAI SDK

When deploying a model with Engine-Builder, you will get an OpenAI compatible server. If you are already using one of the OpenAI SDKs, you’ll simply need to update the base url to your Baseten model URL and include your Baseten API Key.
import os
from openai import OpenAI

model_id = "abcdef" # TODO: replace with your model id
api_key = os.environ.get("BASETEN_API_KEY")
model_url = f"https://model-{model_id}.api.baseten.co/environments/production/sync/v1"

client = OpenAI(
    base_url=model_url,
    api_key=api_key,
)

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-3B-Instruct",  # must match --served-model-name in the deployment
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

External LLM gateways

Any LLM gateway that speaks the OpenAI protocol, such as LiteLLM or OpenRouter, can route traffic to a Baseten deployment. Configure the gateway with three values:
  • Base URL: https://model-{model_id}.api.baseten.co/environments/production/sync/v1, using the model ID for your deployment. Click API endpoint on the model page in the Baseten dashboard to copy the full URL.
  • Model name: The value of --served-model-name from your deployment’s start_command. See the vLLM example for where this is set. When a single gateway routes to several deployments, use an org/model naming convention (for example, acme/llama-3-70b) to keep routing unambiguous.
  • API key: A Baseten API key with access to the deployment.
The gateway sends requests to {base_url}/chat/completions with model set to the served model name and an Authorization: Api-Key <key> header.

Alternative invocation methods

  • Truss CLI: truss predict
  • Model Dashboard: “Playground” button in the Baseten UI