Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

In Frontier Gateway, rate and usage limits live on the federated user, not on individual API keys. Every key minted under a user inherits the user’s per-model limits, so rotating a customer’s credentials doesn’t change what they can spend. Rate limits cap short-window throughput (per second or per minute), and usage limits cap total consumption per daily window. Both are scoped to a single (user, model slug) pair, so a user can carry separate limits for every model their keys are allowed to call. You configure both kinds of limit by passing them inside models[].rate_limits and models[].usage_limits when you call POST /v1/gateway/users. Workspace API keys and the shared Model APIs product use a different limit model. For the disambiguation, see Frontier Gateway versus Model APIs limits.

Rate limits

A rate limit caps short-window throughput. You attach one or more rate limits to each model slug on a federated user.
FieldValuesDescription
typeTOKEN, REQUESTWhether the limit counts tokens (prompt plus completion) or requests.
unitSECOND, MINUTEThe window the threshold applies to.
thresholdInteger >= 1The maximum count allowed per window.
You can set both a TOKEN and a REQUEST rate limit on the same model slug, but you can’t set two rate limits with the same type.
{
  "customer_id": "cust_42",
  "models": [
    {
      "slug": "your-org/your-model",
      "rate_limits": [
        { "type": "TOKEN", "unit": "MINUTE", "threshold": 1000000 },
        { "type": "REQUEST", "unit": "MINUTE", "threshold": 100 }
      ]
    }
  ]
}
In this example, the user can spend up to one million prompt-plus-completion tokens per minute on your-org/your-model, and up to 100 requests per minute against the same model. Both ceilings are enforced; whichever the caller hits first triggers a 429 Too Many Requests response.

Usage limits

A usage limit caps how much a user can spend in a daily window. Usage limits are optional. You can attach a usage limit to any model slug the user is allowed to call.
FieldValuesDescription
typeTOKEN, REQUESTWhether the limit counts tokens or requests.
unitDAYThe window the threshold applies to. Daily is the only supported window.
thresholdInteger >= 1The maximum count allowed per daily window.
Both TOKEN and REQUEST are supported as the type for a usage limit:
{
  "customer_id": "cust_42",
  "models": [
    {
      "slug": "your-org/your-model",
      "usage_limits": [
        { "type": "TOKEN", "unit": "DAY", "threshold": 10000000 },
        { "type": "REQUEST", "unit": "DAY", "threshold": 5000 }
      ]
    }
  ]
}
In this example, the user can spend up to ten million tokens per day and up to 5,000 requests per day on your-org/your-model. Whichever ceiling the caller hits first triggers a 429 Too Many Requests response for the rest of the daily window.

Per-model scope

Limits are scoped per (user, model slug) pair. A federated user can be authorized for multiple model slugs, and each slug carries its own independent rate-limit and usage-limit buckets. Spending tokens against one model doesn’t draw down another model’s budget for the same user.
{
  "customer_id": "cust_42",
  "models": [
    {
      "slug": "your-org/your-model",
      "rate_limits": [
        { "type": "TOKEN", "unit": "MINUTE", "threshold": 1000000 }
      ],
      "usage_limits": [
        { "type": "TOKEN", "unit": "DAY", "threshold": 10000000 }
      ]
    },
    {
      "slug": "your-org/your-other-model",
      "rate_limits": [
        { "type": "REQUEST", "unit": "SECOND", "threshold": 20 }
      ]
    }
  ]
}
In this example, your-org/your-model carries a per-minute token rate limit and a daily token usage limit, while your-org/your-other-model carries only a per-second request rate limit. The two slugs are independent.

Enforcement and reset

Rate-limit enforcement, daily usage windows, and the midnight UTC reset are Baseten platform behavior, not Frontier Gateway features. Frontier Gateway attaches the limits to your federated users; the underlying platform meters traffic against them and behaves the same way for other Baseten products that consume the same primitives. When a request from one of a federated user’s keys exceeds any of the user’s configured limits on a given model slug, the platform rejects the request with 429 Too Many Requests. The 429 fires for the first limit hit: if a user has a TOKEN/MINUTE rate limit and a REQUEST/DAY usage limit, either can trigger rejection. Daily usage windows reset at midnight UTC. After reset, a user’s consumption for each DAY limit returns to zero and they can spend up to the threshold again over the next 24 hours. Rate-limit windows (per second, per minute) are short rolling windows enforced inline on every request and don’t have a reset timestamp you need to track.

Frontier Gateway versus Model APIs limits

Frontier Gateway and the shared Model APIs product use different limit models:
  • Frontier Gateway limits are per federated user, per model slug. You configure TOKEN/REQUEST rate limits (SECOND or MINUTE) and optional TOKEN/REQUEST usage limits (DAY) on the user, and every key minted under the user inherits them.
  • Model APIs limits are account-tier RPM/TPM ceilings that apply to your workspace API key as a whole, regardless of which Model APIs model you’re calling.
For more information on Model APIs limits, see Rate limits and budgets.

Next steps