Parameters

model_id
string
required

The ID of the model you want to call.

Environment Name
string
required

The name of the model’s environment you want to call.

Headers

Authorization
string
required

Your Baseten API key, formatted with prefix Api-Key (e.g. {"Authorization": "Api-Key abcd1234.abcd1234"}).

Body

There is a 256 KiB size limit toΒ /async_predictΒ request payloads.

model_input
json
required

JSON-serializable model input.

webhook_endpoint
string
default:
"null"
Baseten does not store model outputs. If webhook_endpoint is empty, your model must save prediction outputs so they can be accessed later.

URL of the webhook endpoint. We require that webhook endpoints use HTTPS. Both HTTP/2 and HTTP/1.1 protocols are supported.

priority
integer
default:
0

Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1).

priority is between 0 and 2, inclusive.

max_time_in_queue_seconds
integer
default:
600

Maximum time a request will spend in the queue before expiring.

max_time_in_queue_seconds must be between 10 seconds and 72 hours, inclusive.

inference_retry_config
json

Exponential backoff parameters used to retry the model predict request.

Response

request_id
string
required

The ID of the async request.

Rate limits

Two types of rate limits apply when making async requests:

  • Calls to the /async_predict endpoint are limited to 200 requests per second.

  • Each organization is limited to 50,000 QUEUED or IN_PROGRESS async requests, summed across all deployments.

If either limit is exceeded, subsequent /async_predict requests will receive a 429 status code.

To avoid hitting these rate limits, we advise:

  • Implementing a backpressure mechanism, such as calling /async_predict with exponential backoff in response to 429 errors.
  • Monitoring the async queue size metric. If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings.