curl --request POST \
  --url https://model-{model_id}.api.baseten.co/environments/{env_name}/async_predict \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '{
  "model_input": "<any>",
  "webhook_endpoint": "<string>",
  "priority": 123,
  "max_time_in_queue_seconds": 123,
  "inference_retry_config": "<any>"
}'
{
  "request_id": "<string>"
}

Parameters

model_id
string
required

The ID of the model you want to call.

env_name
string
required

The name of the model’s environment you want to call.

Headers

Authorization
string
required

Your Baseten API key, formatted with prefix Api-Key (e.g. {"Authorization": "Api-Key abcd1234.abcd1234"}).

Body

There is a 256 KiB size limit to /async_predict request payloads.

model_input
json
required

JSON-serializable model input.

webhook_endpoint
string
default:"null"
Baseten does not store model outputs. If webhook_endpoint is empty, your model must save prediction outputs so they can be accessed later.

URL of the webhook endpoint. We require that webhook endpoints use HTTPS. Both HTTP/2 and HTTP/1.1 protocols are supported.

priority
integer
default:0

Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1).

priority is between 0 and 2, inclusive.

max_time_in_queue_seconds
integer
default:600

Maximum time a request will spend in the queue before expiring.

max_time_in_queue_seconds must be between 10 seconds and 72 hours, inclusive.

inference_retry_config
json

Exponential backoff parameters used to retry the model predict request.

Response

request_id
string
required

The ID of the async request.

{
  "request_id": "<string>"
}

Rate limits

Two types of rate limits apply when making async requests:

  • Calls to the /async_predict endpoint are limited to 200 requests per second.

  • Each organization is limited to 50,000 QUEUED or IN_PROGRESS async requests, summed across all deployments.

If either limit is exceeded, subsequent /async_predict requests will receive a 429 status code.

To avoid hitting these rate limits, we advise:

  • Implementing a backpressure mechanism, such as calling /async_predict with exponential backoff in response to 429 errors.
  • Monitoring the async queue size metric. If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings.