Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1).priority is between 0 and 2, inclusive.
Two types of rate limits apply when making async requests:
Calls to the /async_predict endpoint are limited to 200 requests per second.
Each organization is limited to 5,000 QUEUED or IN_PROGRESS async requests, summed across all deployments.
If either limit is exceeded, subsequent /async_predict requests will receive a 429 status code.To avoid hitting these rate limits, we advise:
Implementing a backpressure mechanism, such as calling /async_predict with exponential backoff in response to 429 errors.
Monitoring the async queue size metric. If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings.