Skip to main content
When an inference request fails, the error message alone often isn’t enough to tell you what to fix. The status code is the first clue. It tells you the broad category, and because your request passes through Baseten’s inference gateway before it reaches your model, it also points at where the failure happened: your request, the Baseten platform, or your model’s own code. Every failed response is JSON, in the form {"error": "<message>"}. This page maps each status code to its likely source and the fastest way to confirm it. Start with the quick reference, then read the section for the status code you received. Streaming and async requests report failures differently, see Streaming and async errors.

How to read an inference error

A failed response has two parts worth reading:
  • The status code: the broad category, returned in the HTTP response (502, 503, and so on).
  • The error body: {"error": "<message>"}. The message is either a short string Baseten generates (such as Error making prediction) or, when your model itself returned the error, the raw response your model produced, passed through unchanged.
One distinction resolves most failures: did your model return the error, or was your model unreachable? See Is it my model or Baseten?.

Quick reference

StatusWhat it usually meansWhere to look
400 / 401 / 403 / 404Malformed request, invalid API key, or wrong model IDYour request: check the endpoint and API key
402Billing or payment issue on the accountBilling and usage
413Request body too large: 100 MB edge cap (all requests), or 256 KiB async defaultSend large inputs by file or URL, not inline
502Your container crashed or restarted mid-request, or a Baseten-side errorModel logs for a crash or OOMKilled; if clean, retry
503Container not available yet: draining or routingMostly transient: retry with exponential backoff
504The prediction exceeded the request timeout (1200s sync)Profile the model, raise resources, or use async

Is it my model or Baseten?

When a request reaches your model and your model server responds with an error, Baseten passes that response through to you. When the request never gets a real answer from your model, Baseten returns its own error instead. Telling these apart is the fastest way to know where to debug. Your model returned the error: the status code and body come from your model server. Your handler raised an exception, returned a non-2xx status, or timed out internally. Debug it like any application bug, starting from the error body and your model logs. Your model was unreachable: the request failed in front of your container, so you get a message like Error making prediction with a 502 or 503. The container is down, restarting, was killed (for example, out of memory), or is still cold-starting. The most common version of the second case is memory pressure. When you increase your payload or batch sizes, each request uses more memory, and the container can be killed mid-request (OOMKilled). Confirm it by checking your model logs for OOMKilled or repeated restarts, and your metrics for memory pressure and replica restarts. To fix it, reduce per-request memory with smaller batches or payloads, or move to a larger instance type.

Errors by status code

Each section below covers what the status code means, its common causes, and what to check first.

400, 401, 403, 404: request and authentication errors

These point to the request itself, not your model.
  • 401 / 403: the API key is missing or invalid. Use a valid Baseten API key in the Authorization header.
  • 404: the model or deployment ID doesn’t exist, or the model was deleted. Confirm the ID and that you’re calling the right predict endpoint.
  • 400: the request or URL is malformed. Check the request body is valid JSON and the path is correct.
For Truss CLI authentication errors (such as a missing key in ~/.trussrc), see Troubleshooting inference.

402: payment required

The account has an unresolved billing or payment issue, such as exhausted credits with no payment method on file. Check your billing and usage settings, or contact your account owner.

413: payload too large

The request body exceeded a size limit. Two limits can return a 413:
  • Edge limit: the ingress proxy caps every inbound request body at 100 MB and rejects larger requests before they reach your model or chain. It covers the full HTTP body, including the JSON envelope and any base64-encoded media, and isn’t configurable.
  • Async limit: async requests have a smaller per-organization cap, 256 KiB (262,144 bytes) by default. The message reports both sizes: payload size of X bytes exceeds maximum size of Y bytes.
The async payload limit is set per organization. Contact support to raise it for your organization. The 100 MB edge limit is fixed.
For large inputs, send a file or URL the model fetches instead of inlining the bytes in the request body. See Model I/O with files. A payload that fits under the edge limit but exhausts the container’s memory returns a 502, not a 413. See Is it my model or Baseten?.

502: bad gateway

A 502 has two meanings, and they need different responses:
  • Your container was unreachable: the most common case. The container crashed, restarted, or was killed (for example, OOMKilled) mid-request. The body is a short message like Error making prediction. Check your model logs for a crash or restart. See Is it my model or Baseten?.
  • A Baseten-side error: a transient problem in the gateway or routing layer. If your model logs are clean and show no crash, retry with exponential backoff.
You might also see a 502 reported as client closed connection. That means the client disconnected before the response finished, not a Baseten or model failure.

503: service unavailable

The container isn’t available to take the request yet. This is usually transient. Retry with exponential backoff. Common causes:
  • Draining: an instance is shutting down during a deploy or scale-down. The message asks you to retry on another instance.
  • Routing: the request couldn’t be routed to a workload plane, or a circuit breaker is open to protect an unhealthy upstream.
A request that arrives while the deployment is scaling up from zero doesn’t return a 503. Baseten holds it at the routing layer until a replica is ready, then forwards it. See Request lifecycle.
Async requests return 503 rather than 502 when the async service isn’t set up on the workload plane yet. See Async inference.

504: gateway timeout

The prediction ran longer than the request timeout (1200 seconds for sync predict). Common causes are a model that’s too slow for the payload, an under-provisioned instance, or a hung request. Profile the model, raise its resources, or move long-running work to the async API, which allows up to 3600 seconds. If you see a timeout sooner than 1200 seconds, check your client’s own timeout. Set it to match your model’s expected response time. See Configure HTTP clients.

Streaming and async errors

Not every failure arrives as a status code with a JSON body. Streaming responses: a streaming response starts with a 200 and an open connection, so a failure partway through (a timeout, or the model becoming unavailable) can’t be sent as a 5xx with an error body. The stream ends early instead. If a stream stops before you receive the end of the response, treat it as a failed request: check your model logs and retry. Async requests: a failure isn’t reported on the submit response. The submit returns once the request is queued, and any error is reported later in the result payload’s errors array, which is empty on success. See Async inference.

Where to look next

When an error isn’t self-explanatory, these are the fastest places to confirm a cause:
  • Model logs: crashes, restarts, OOMKilled, and your model’s own error output.
  • Metrics: memory pressure, replica count and restarts, and queue depth.
  • Request lifecycle: how queuing, cold starts, and concurrency affect request handling.
  • Async inference: for payloads or runtimes that shouldn’t go through the synchronous path.
If you’re still stuck, contact support with your model or deployment ID, the timestamp, the status code, and the request ID.