Production deployment
Use this endpoint to call the production deployment of your model asynchronously.
Parameters
The ID of the model you want to call.
Headers
Your Baseten API key, formatted with prefix Api-Key
(e.g. {"Authorization": "Api-Key abcd1234.abcd1234"}
).
Body
There is a 256 KiB size limit to /async_predict
 request payloads.
JSON-serializable model input.
webhook_endpoint
is empty, your model must save prediction outputs so they can be accessed later. URL of the webhook endpoint. We require that webhook endpoints use HTTPS.
Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1).
priority
is between 0 and 2, inclusive.
Maximum time a request will spend in the queue before expiring.
max_time_in_queue_seconds
must be between 10 seconds and 12 hours, inclusive.
Exponential backoff parameters used to retry the model predict request.
Response
The ID of the async request.
Rate limits
Two types of rate limits apply when making async requests:
-
Calls to the
/async_predict
endpoint are limited to 200 requests per second. -
Each organization is limited to 50,000
QUEUED
orIN_PROGRESS
async requests, summed across all deployments.
If either limit is exceeded, subsequent /async_predict
requests will receive a 429 status code.
To avoid hitting these rate limits, we advise:
- Implementing a backpressure mechanism, such as calling
/async_predict
with exponential backoff in response to 429 errors. - Monitoring the async queue size metric. If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings.