Inference errors

When an inference request fails, the error message alone often isn’t enough to tell you what to fix. The status code is the first clue. It tells you the broad category, and because your request passes through Baseten’s inference gateway before it reaches your model, it also points at where the failure happened: your request, the Baseten platform, or your model’s own code. Every failed response is JSON, in the form {"error": "<message>"}. This page maps each status code to its likely source and the fastest way to confirm it. Start with the quick reference, then read the section for the status code you received. Streaming and async requests report failures differently. See Streaming and async errors.

How to read an inference error

A failed response has two parts worth reading:

The status code: the broad category, returned in the HTTP response (502, 503, and so on).
The error body: {"error": "<message>"}. The message is either a short string Baseten generates (such as Error making prediction) or, when your model itself returned the error, the raw response your model produced, passed through unchanged.

One distinction resolves most failures: did your model return the error, or was your model unreachable? See Is it my model or Baseten?.

Quick reference

Status	What it usually means	Where to look
`400` / `401` / `403` / `404`	Malformed request, invalid API key, or wrong model ID	Your request: check the endpoint and API key
`402`	Billing or payment issue on the account	Billing and usage
`413`	Request body too large: 100 MB edge cap (all requests), or 256 KiB async default	Send large inputs by file or URL, not inline
`429`	Rate limit exceeded, or no capacity was available for the request	Back off and retry; see 429
`499`	The client disconnected before the response was written	Usually informational; check client-side timeouts
`500`	Your model code raised an exception, or no replica became available before the parking timeout	Model logs for the traceback; see 500
`502`	Your container crashed or restarted mid-request, or a Baseten-side error	Model logs for a crash or `OOMKilled`; if clean, retry
`503`	Container not available yet: draining or routing	Mostly transient: retry with exponential backoff
`504`	The prediction exceeded the request timeout (1200s sync)	Profile the model, raise resources, or use async

Is it my model or Baseten?

When a request reaches your model and your model server responds with an error, Baseten passes that response through to you. When the request never gets a real answer from your model, Baseten returns its own error instead. Telling these apart is the fastest way to know where to debug. Your model returned the error: the status code and body come from your model server. Your handler raised an exception, returned a non-2xx status, or timed out internally. Debug it like any application bug, starting from the error body and your model logs. Your model was unreachable: the request failed in front of your container, so you get a message like Error making prediction with a 502 or 503. The container is down, restarting, was killed (for example, out of memory), or is still cold-starting. The most common version of the second case is memory pressure. When you increase your payload or batch sizes, each request uses more memory, and the container can be killed mid-request (OOMKilled). Confirm it by checking your model logs for OOMKilled or repeated restarts, and your metrics for memory pressure and replica restarts. To fix it, reduce per-request memory with smaller batches or payloads, or move to a larger instance type.

Errors by status code

Each section below covers what the status code means, its common causes, and what to check first.

400, 401, 403, 404: request and authentication errors

These point to the request itself, not your model.

401: the request has no credentials; the body reads Unauthorized. Send a Baseten API key in the Authorization header.
403: the key is invalid or revoked; the body reads Authentication failed. Check the key’s value and that it belongs to the right workspace.
404: the model or deployment ID doesn’t exist, or the model was deleted. Confirm the ID and that you’re calling the right predict endpoint.
400: the request or URL is malformed. Check the request body is valid JSON and the path is correct.

For Truss CLI authentication errors (such as a missing key in ~/.trussrc), see Troubleshooting inference.

402: payment required

The account has an unresolved billing or payment issue, such as exhausted credits with no payment method on file. Check your billing and usage settings, or contact your account owner.

413: payload too large

The request body exceeded a size limit. Two limits can return a 413:

Edge limit: the ingress proxy caps every inbound request body at 100 MB and rejects larger requests before they reach your model or chain. It covers the full HTTP body, including the JSON envelope and any base64-encoded media, and isn’t configurable.
Async limit: async requests have a smaller per-organization cap, 256 KiB (262,144 bytes) by default. The message reports both sizes: payload size of X bytes exceeds maximum size of Y bytes.

The async payload limit is set per organization. Contact support to raise it for your organization. The 100 MB edge limit is fixed.

For large inputs, send a file or URL the model fetches instead of inlining the bytes in the request body. See Model I/O with files. A payload that fits under the edge limit but exhausts the container’s memory returns a 502, not a 413. See Is it my model or Baseten?.

429: too many requests

The request exceeded a rate limit, or arrived when no capacity was available. Where the limit lives depends on the surface you’re calling:

Model APIs: your workspace exceeded its requests-per-minute or tokens-per-minute limit. Cached input tokens count toward the token limit at full weight. See Rate limits and budgets for the default limits per tier and how to request an increase.
Async endpoints: /async_predict allows 12,000 requests per minute per organization, and the status and cancel endpoints allow 100 requests per second. See Async inference.
Dedicated deployments: the deployment ran out of capacity rather than hitting a rate limit. An async request arrived while an earlier one was already parked waiting for a replica (the body carries a CAPACITY_EXCEEDED error code), or load shedding rejected the request because queued payloads exceeded a memory threshold. See Request queuing and load shedding.

Retry with exponential backoff. A short backoff usually clears a burst; a persistent stream of 429s means your steady-state traffic exceeds the limit, which backoff can’t fix. For capacity 429s on dedicated deployments, increase max replicas or the concurrency target instead.

499: client closed request

A 499 means the client disconnected before the response was written: a user closed a browser tab, a client-side timeout fired, or the calling service cancelled the request. Because the client has already gone away, a 499 shows up in your deployment’s request metrics and logs rather than as a response your code handles. The metrics count the request under status 499, and the logs record Client closed connection with the request ID. When the client disconnects, Baseten cancels the in-flight work and aborts the running request on the inference engine, so you don’t pay for GPU time generating output nobody reads. See Request cancellation. Occasional 499s are normal and need no server-side action. A steady stream of them usually means a client-side timeout shorter than your model’s response time, cancelling requests that would have succeeded. Set the client timeout to match your model’s expected latency. See Configure HTTP clients.

500: internal server error

A 500 has two meanings, distinguished by the error message:

Internal Server Error (in model/chainlet).: your model code raised an exception during the request. The full traceback is in your model logs; debug it like any application bug.
Failed to route request: a sync request was parked waiting for a replica, and none became available before the parking timeout (equal to the 1200-second predict timeout) expired. The most common cause is a cold start that takes longer than the timeout, such as a large model loading weights after scaling from zero. Retry after a brief wait of 30 seconds to a minute; the replica that was starting is often ready by then. If it happens regularly, keep minimum replicas above zero so requests always have somewhere to go, or see Cold starts for ways to make startup faster.

502: bad gateway

A 502 has two meanings, and they need different responses:

Your container was unreachable: the most common case. The container crashed, restarted, or was killed (for example, OOMKilled) mid-request. The body is a short message like Error making prediction. Check your model logs for a crash or restart. See Is it my model or Baseten?.
A Baseten-side error: a transient problem in the gateway or routing layer. If your model logs are clean and show no crash, retry with exponential backoff.

503: service unavailable

The container isn’t available to take the request yet. This is usually transient. Retry with exponential backoff. Common causes:

Draining: an instance is shutting down during a deploy or scale-down. The message asks you to retry on another instance.
Routing: the request couldn’t be routed to a workload plane, or a circuit breaker is open to protect an unhealthy upstream.

A request that arrives while the deployment is scaling up from zero doesn’t return a 503. Baseten holds it at the routing layer until a replica is ready, then forwards it. See Request lifecycle.

Async requests return 503 rather than 502 when the async service isn’t set up on the workload plane yet. See Async inference.

504: gateway timeout

The prediction ran longer than the request timeout (1200 seconds for sync predict). Common causes are a model that’s too slow for the payload, an under-provisioned instance, or a hung request. Profile the model, raise its resources, or move long-running work to the async API, which allows up to 3600 seconds. If you see a timeout sooner than 1200 seconds, check your client’s own timeout. Set it to match your model’s expected response time. See Configure HTTP clients.

Streaming and async errors

Not every failure arrives as a status code with a JSON body. Streaming responses: a streaming response starts with a 200 and an open connection, so a failure partway through (a timeout, or the model becoming unavailable) can’t be sent as a 5xx with an error body. The stream ends early instead. If a stream stops before you receive the end of the response, treat it as a failed request: check your model logs and retry. Async requests: a failure isn’t reported on the submit response. The submit returns once the request is queued, and any error is reported later in the result payload’s errors array, which is empty on success. See Async inference.

Where to look next

When an error isn’t self-explanatory, these are the fastest places to confirm a cause:

Model logs: crashes, restarts, OOMKilled, and your model’s own error output.
Metrics: memory pressure, replica count and restarts, and queue depth.
Request lifecycle: how queuing, cold starts, and concurrency affect request handling.
Async inference: for payloads or runtimes that shouldn’t go through the synchronous path.

If you’re still stuck, contact support with your model or deployment ID, the timestamp, the status code, and the request ID.

Overview

Get started

Model APIs

Inference

Development

Deployment

Engines

Frontier Gateway

Training

Organization

Observability

Troubleshooting

Inference errors

How to read an inference error

Quick reference

Is it my model or Baseten?

Errors by status code

400, 401, 403, 404: request and authentication errors

402: payment required

413: payload too large

429: too many requests

499: client closed request

500: internal server error

502: bad gateway

503: service unavailable

504: gateway timeout

Streaming and async errors

Where to look next

​How to read an inference error

​Quick reference

​Is it my model or Baseten?

​Errors by status code

​400, 401, 403, 404: request and authentication errors

​402: payment required

​413: payload too large

​429: too many requests

​499: client closed request

​500: internal server error

​502: bad gateway

​503: service unavailable

​504: gateway timeout

​Streaming and async errors

​Where to look next

How to read an inference error

Quick reference

Is it my model or Baseten?

Errors by status code

400, 401, 403, 404: request and authentication errors

402: payment required

413: payload too large

429: too many requests

499: client closed request

500: internal server error

502: bad gateway

503: service unavailable

504: gateway timeout

Streaming and async errors

Where to look next