Parameters

model_id
string
required

The ID of the model you want to call.

Headers

Authorization
string
required

Your Baseten API key, formatted with prefix Api-Key (e.g. {"Authorization": "Api-Key abcd1234.abcd1234"}).

Body

There is a 256 KiB size limit toΒ /async_predictΒ request payloads.

model_input
json
required

JSON-serializable model input.

webhook_endpoint
string
default: "null"
Baseten does not store model outputs. If webhook_endpoint is empty, your model must save prediction outputs so they can be accessed later.

URL of the webhook endpoint. We require that webhook endpoints use HTTPS.

priority
integer
default: 0

Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1).

priority is between 0 and 2, inclusive.

max_time_in_queue_seconds
integer
default: 600

Maximum time a request will spend in the queue before expiring.

max_time_in_queue_seconds must be between 10 seconds and 72 hours, inclusive.

inference_retry_config
json

Exponential backoff parameters used to retry the model predict request.

Response

request_id
string
required

The ID of the async request.

Rate limits

Two types of rate limits apply when making async requests:

  • Calls to the /async_predict endpoint are limited to 200 requests per second.

  • Each organization is limited to 50,000 QUEUED or IN_PROGRESS async requests, summed across all deployments.

If either limit is exceeded, subsequent /async_predict requests will receive a 429 status code.

To avoid hitting these rate limits, we advise:

  • Implementing a backpressure mechanism, such as calling /async_predict with exponential backoff in response to 429 errors.
  • Monitoring the async queue size metric. If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings.