Async development

Parameters

model_id

string

required

The ID of the model you want to call.

Headers

Authorization

string

required

Your Baseten API key, formatted with prefix Api-Key (e.g. {"Authorization": "Api-Key abcd1234.abcd1234"}).

Body

There is a 256 KiB size limit to /async_predict request payloads.

model_input

json

required

JSON-serializable model input.

webhook_endpoint

string

default:"null"

Baseten does not store model outputs. If webhook_endpoint is empty, your model must save prediction outputs so they can be accessed later.

URL of the webhook endpoint. We require that webhook endpoints use HTTPS. Both HTTP/2 and HTTP/1.1 protocols are supported.

priority

integer

default:0

Priority of the request. A lower value corresponds to a higher priority (e.g. requests with priority 0 are scheduled before requests of priority 1).priority is between 0 and 2, inclusive.

max_time_in_queue_seconds

integer

default:600

Maximum time a request will spend in the queue before expiring.max_time_in_queue_seconds must be between 10 seconds and 72 hours, inclusive.

inference_retry_config

json

Exponential backoff parameters used to retry the model predict request.

Show child attributes

max_attempts

integer

default:3

Number of predict request attempts.max_attempts must be between 1 and 10, inclusive.

initial_delay_ms

integer

default:1000

Minimum time between retries in milliseconds.initial_delay_ms must be between 0 and 10,000 milliseconds, inclusive.

max_delay_ms

integer

default:5000

Maximum time between retries in milliseconds.max_delay_ms must be between 0 and 60,000 milliseconds, inclusive.

Response

request_id

string

required

The ID of the async request.

Rate limits

Two types of rate limits apply when making async requests:

Calls to the /async_predict endpoint are limited to 200 requests per second.
Each organization is limited to 50,000 QUEUED or IN_PROGRESS async requests, summed across all deployments.

If either limit is exceeded, subsequent /async_predict requests will receive a 429 status code. To avoid hitting these rate limits, we advise:

Implementing a backpressure mechanism, such as calling /async_predict with exponential backoff in response to 429 errors.
Monitoring the async queue size metric. If your model is accumulating a backlog of requests, consider increasing the number of requests your model can process at once by increasing the number of max replicas or the concurrency target in your autoscaling settings.

import requests
import os

model_id = ""
webhook_endpoint = ""

# Read secrets from environment variables

baseten_api_key = os.environ["BASETEN_API_KEY"]

resp = requests.post(
f"https://model-{model_id}.api.baseten.co/development/async_predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
json={
"model_input": {"prompt": "hello world!"},
"webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc
},
)

print(resp.json())

{
  "request_id": "<string>"
}

Reference

Inference API

Management API

CLI reference

SDK reference

Async development

Parameters

Headers

Body

Response

Rate limits

Reference

Inference API

Management API

CLI reference

SDK reference

​Parameters

​Headers

​Body

​Response

​Rate limits

Parameters

Headers

Body

Response

Rate limits