Calling your model

If an AI lab has given you a federated API key for their model, this guide shows you how to call that model through Baseten Frontier Gateway. The gateway is OpenAI-compatible, so any OpenAI SDK or HTTP client works with two changes: the base URL and the auth header.

The gateway accepts the OpenAI Chat Completions API. If your code already targets OpenAI, point the base URL at Baseten and swap the key. No other changes required.

Base URL

The default base URL is:

https://inference.baseten.co/v1

If your lab uses a branded domain for the gateway (for example, https://api.your-lab.com/v1), use that URL instead. Your lab will tell you which URL to use; the request shape is the same.

Authentication

Pass your federated API key in the Authorization header using the Api-Key scheme, not Bearer:

Authorization: Api-Key YOUR_API_KEY

If your client defaults to Authorization: Bearer ..., override it. Federated keys sent as Bearer tokens are rejected. The key was issued to you by your lab through Baseten’s federated key management. You don’t manage rotation or limits; those are configured on the lab’s side. Treat the key like any other API secret: store it in an environment variable or secret manager, never in source control.

OpenAI SDK example

Make a chat completion request with the federated key your lab gave you. Replace YOUR_API_KEY with that key, and your-org/your-model with the model slug your lab gave you.

Python
JavaScript

Install the OpenAI SDK:

pip install openai

Make a chat completion request:

chat.py

from openai import OpenAI

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="your-org/your-model",
    messages=[{"role": "user", "content": "Hello, world!"}],
)

print(response.choices[0].message.content)

Install the OpenAI SDK:

npm install openai

Make a chat completion request:

chat.js

import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "https://inference.baseten.co/v1",
    apiKey: "YOUR_API_KEY",
});

const response = await client.chat.completions.create({
    model: "your-org/your-model",
    messages: [{ role: "user", content: "Hello, world!" }],
});

console.log(response.choices[0].message.content);

The response follows the standard OpenAI Chat Completions schema:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "model": "your-org/your-model",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 9,
    "total_tokens": 19
  }
}

curl example

For raw HTTP usage:

curl --request POST \
  --url https://inference.baseten.co/v1/chat/completions \
  --header "Content-Type: application/json" \
  --header "Authorization: Api-Key YOUR_API_KEY" \
  --data '{
    "model": "your-org/your-model",
    "messages": [
      {"role": "user", "content": "Hello, world!"}
    ]
  }'

Model slug format

Model slugs are formatted as your-org/your-model (for example, acme/llama-3-70b). Pass the slug as the model parameter on every request. Your lab will tell you which slug or slugs your key has access to; a single key can be authorized for one or more models.

Streaming, structured outputs, and tool calling

The gateway supports streaming, JSON-schema structured outputs, and tool calling through standard OpenAI parameters (stream, response_format, tools). The configuration and usage patterns are identical to any OpenAI-compatible endpoint:

For more information on streaming responses, see Streaming.
For more information on JSON-schema and structured generation, see Structured outputs.
For more information on tool calling and function definitions, see Function calling.

Rate limits

Your federated key has rate and usage limits set by your lab. When a limit is exceeded, the gateway returns 429 Too Many Requests. For more information on the limit shape, daily reset behavior, and 429 handling, see Rate and usage limits.

Run a lab serving a model?

If you’re the lab issuing federated keys (rather than a developer consuming them), the Frontier Gateway overview covers group and key management, rate limits, and billing webhooks.

Documentation Index

​Base URL

​Authentication

​OpenAI SDK example

​curl example

​Model slug format

​Streaming, structured outputs, and tool calling

​Rate limits

​Run a lab serving a model?

Base URL

Authentication

OpenAI SDK example

curl example

Model slug format

Streaming, structured outputs, and tool calling

Rate limits

Run a lab serving a model?