> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Call your model

> Run inference on deployed models

This page covers calling self-deployed models with your workspace API key. For hosted open-source models with no deployment step, see [Model APIs](/inference/model-apis/overview). If you've been issued a federated key by an AI lab and want to call their model through Baseten Frontier Gateway, see [Calling your model through Frontier Gateway](/frontier-gateway/calling-your-model).

Once deployed, your model is accessible through an [API endpoint](/reference/inference-api/overview). To make an inference request, you'll need:

* **Model ID**: Found in the Baseten dashboard or returned when you deploy.
* **[API key](/organization/api-keys)**: Authenticates your requests.
* **JSON-serializable model input**: The data your model expects.

## Authentication

Include your API key in the `Authorization` header:

```sh theme={"system"}
curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/predict \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, world!"}'
```

In Python with requests:

```python theme={"system"}
import requests
import os

api_key = os.environ["BASETEN_API_KEY"]
model_id = "YOUR_MODEL_ID"

response = requests.post(
    f"https://model-{model_id}.api.baseten.co/environments/production/predict",
    headers={"Authorization": f"Api-Key {api_key}"},
    json={"prompt": "Hello, world!"},
)

print(response.json())
```

<Note>
  Baseten also accepts `Authorization: Bearer <api_key>` for compatibility with OpenAI-style clients and AI gateways such as LiteLLM and OpenRouter. Both formats work on every Baseten endpoint.

  ```sh theme={"system"}
  curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/predict \
    -H "Authorization: Bearer $BASETEN_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Hello, world!"}'
  ```
</Note>

## Predict API endpoints

Baseten provides multiple endpoints for different inference modes:

* [`/predict`](/reference/inference-api/overview#predict-endpoints): Standard synchronous inference.
* [`/async_predict`](/reference/inference-api/overview#predict-endpoints): Asynchronous inference for long-running tasks.

Endpoints are available for environments and all deployments. See the [API reference](/reference/inference-api/overview) for details.

## Sync API endpoints

Custom servers support both `predict` endpoints as well as a special `sync` endpoint. By using the `sync` endpoint you are able to call different routes in your custom server.

```
https://model-{model-id}.api.baseten.co/environments/{production}/sync/{route}
```

Here are a few examples that show how the sync endpoint maps to the custom server's routes.

* `https://model-{model_id}.../sync/health` -> `/health`
* `https://model-{model_id}.../sync/items` -> `/items`
* `https://model-{model_id}.../sync/items/123` -> `/items/123`

## OpenAI SDK

When deploying a model with Engine-Builder, you will get an OpenAI compatible server. If you are already using one of the OpenAI SDKs, you'll simply need to update the base url to your Baseten model URL and include your Baseten API Key.

```python theme={"system"}
import os
from openai import OpenAI

model_id = "abcdef" # TODO: replace with your model id
api_key = os.environ.get("BASETEN_API_KEY")
model_url = f"https://model-{model_id}.api.baseten.co/environments/production/sync/v1"

client = OpenAI(
    base_url=model_url,
    api_key=api_key,
)

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-3B-Instruct",  # must match --served-model-name in the deployment
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")
```

## External LLM gateways

Any LLM gateway that speaks the OpenAI protocol, such as LiteLLM or OpenRouter, can route traffic to a Baseten deployment. Configure the gateway with three values:

* **Base URL**: `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`, using the model ID for your deployment. Click **API endpoint** on the model page in the Baseten dashboard to copy the full URL.
* **Model name**: The value of `--served-model-name` from your deployment's `start_command`. See the [vLLM example](/examples/vllm) for where this is set. When a single gateway routes to several deployments, use an `org/model` naming convention (for example, `acme/llama-3-70b`) to keep routing unambiguous.
* **API key**: A [Baseten API key](/organization/api-keys) with access to the deployment.

The gateway sends requests to `{base_url}/chat/completions` with `model` set to the served model name and an `Authorization: Api-Key <key>` header.

## Alternative invocation methods

* **Truss CLI**: [`truss predict`](/reference/cli/truss/predict)
* **Model Dashboard**: "Playground" button in the Baseten UI
