{"error": "<message>"}. This page maps each status code to its likely source and the fastest way to confirm it. Start with the quick reference, then read the section for the status code you received. Streaming and async requests report failures differently, see Streaming and async errors.
How to read an inference error
A failed response has two parts worth reading:- The status code: the broad category, returned in the HTTP response (
502,503, and so on). - The error body:
{"error": "<message>"}. The message is either a short string Baseten generates (such asError making prediction) or, when your model itself returned the error, the raw response your model produced, passed through unchanged.
Quick reference
| Status | What it usually means | Where to look |
|---|---|---|
400 / 401 / 403 / 404 | Malformed request, invalid API key, or wrong model ID | Your request: check the endpoint and API key |
402 | Billing or payment issue on the account | Billing and usage |
413 | Request body too large: 100 MB edge cap (all requests), or 256 KiB async default | Send large inputs by file or URL, not inline |
502 | Your container crashed or restarted mid-request, or a Baseten-side error | Model logs for a crash or OOMKilled; if clean, retry |
503 | Container not available yet: draining or routing | Mostly transient: retry with exponential backoff |
504 | The prediction exceeded the request timeout (1200s sync) | Profile the model, raise resources, or use async |
Is it my model or Baseten?
When a request reaches your model and your model server responds with an error, Baseten passes that response through to you. When the request never gets a real answer from your model, Baseten returns its own error instead. Telling these apart is the fastest way to know where to debug. Your model returned the error: the status code and body come from your model server. Your handler raised an exception, returned a non-2xx status, or timed out internally. Debug it like any application bug, starting from the error body and your model logs. Your model was unreachable: the request failed in front of your container, so you get a message likeError making prediction with a 502 or 503. The container is down, restarting, was killed (for example, out of memory), or is still cold-starting.
The most common version of the second case is memory pressure. When you increase your payload or batch sizes, each request uses more memory, and the container can be killed mid-request (OOMKilled). Confirm it by checking your model logs for OOMKilled or repeated restarts, and your metrics for memory pressure and replica restarts. To fix it, reduce per-request memory with smaller batches or payloads, or move to a larger instance type.
Errors by status code
Each section below covers what the status code means, its common causes, and what to check first.400, 401, 403, 404: request and authentication errors
These point to the request itself, not your model.401/403: the API key is missing or invalid. Use a valid Baseten API key in theAuthorizationheader.404: the model or deployment ID doesn’t exist, or the model was deleted. Confirm the ID and that you’re calling the right predict endpoint.400: the request or URL is malformed. Check the request body is valid JSON and the path is correct.
~/.trussrc), see Troubleshooting inference.
402: payment required
The account has an unresolved billing or payment issue, such as exhausted credits with no payment method on file. Check your billing and usage settings, or contact your account owner.413: payload too large
The request body exceeded a size limit. Two limits can return a413:
- Edge limit: the ingress proxy caps every inbound request body at 100 MB and rejects larger requests before they reach your model or chain. It covers the full HTTP body, including the JSON envelope and any base64-encoded media, and isn’t configurable.
- Async limit: async requests have a smaller per-organization cap, 256 KiB (262,144 bytes) by default. The message reports both sizes:
payload size of X bytes exceeds maximum size of Y bytes.
The async payload limit is set per organization. Contact support to raise it for your organization. The 100 MB edge limit is fixed.
502, not a 413. See Is it my model or Baseten?.
502: bad gateway
A502 has two meanings, and they need different responses:
- Your container was unreachable: the most common case. The container crashed, restarted, or was killed (for example,
OOMKilled) mid-request. The body is a short message likeError making prediction. Check your model logs for a crash or restart. See Is it my model or Baseten?. - A Baseten-side error: a transient problem in the gateway or routing layer. If your model logs are clean and show no crash, retry with exponential backoff.
502 reported as client closed connection. That means the client disconnected before the response finished, not a Baseten or model failure.
503: service unavailable
The container isn’t available to take the request yet. This is usually transient. Retry with exponential backoff. Common causes:- Draining: an instance is shutting down during a deploy or scale-down. The message asks you to retry on another instance.
- Routing: the request couldn’t be routed to a workload plane, or a circuit breaker is open to protect an unhealthy upstream.
503. Baseten holds it at the routing layer until a replica is ready, then forwards it. See Request lifecycle.
Async requests return
503 rather than 502 when the async service isn’t set up on the workload plane yet. See Async inference.504: gateway timeout
The prediction ran longer than the request timeout (1200 seconds for sync predict). Common causes are a model that’s too slow for the payload, an under-provisioned instance, or a hung request. Profile the model, raise its resources, or move long-running work to the async API, which allows up to 3600 seconds. If you see a timeout sooner than 1200 seconds, check your client’s own timeout. Set it to match your model’s expected response time. See Configure HTTP clients.Streaming and async errors
Not every failure arrives as a status code with a JSON body. Streaming responses: a streaming response starts with a200 and an open connection, so a failure partway through (a timeout, or the model becoming unavailable) can’t be sent as a 5xx with an error body. The stream ends early instead. If a stream stops before you receive the end of the response, treat it as a failed request: check your model logs and retry.
Async requests: a failure isn’t reported on the submit response. The submit returns once the request is queued, and any error is reported later in the result payload’s errors array, which is empty on success. See Async inference.
Where to look next
When an error isn’t self-explanatory, these are the fastest places to confirm a cause:- Model logs: crashes, restarts,
OOMKilled, and your model’s own error output. - Metrics: memory pressure, replica count and restarts, and queue depth.
- Request lifecycle: how queuing, cold starts, and concurrency affect request handling.
- Async inference: for payloads or runtimes that shouldn’t go through the synchronous path.