Async requests are a “fire and forget” way of executing model inference requests. Instead of waiting for a response from a model, making an async request queues the request, and immediately returns with a request identifier. Optionally, async request results are sent via a POST request to a user-defined webhook upon completion.
Use async requests for:
Long-running inference tasks that may otherwise hit request timeouts.
Batched inference jobs.
Prioritizing certain inference requests.
Async fast facts:
Async requests can be made to any model—no model code changes necessary.
Async requests can remain queued for up to 72 hours and run for up to 1 hour.
Async requests are not compatible with streaming model output.
Async request inputs and model outputs are not stored after an async request has been completed. Instead, model outputs will be sent to your webhook via a POST request.
Provide a webhook endpoint where model outputs will be sent via a POST request. If providing a webhook, you can use async inference on any model, without making any changes to your model code.
Inside your Truss’ model.py, save prediction results to cloud storage. If a webhook endpoint is provided, your model outputs will also be sent to your webhook.
Note that Baseten does not store model outputs. If you do not wish to use a webhook, your model.py must write model outputs to a cloud storage bucket or database as part of its implementation.
1
Setup webhook endpoint
Set up a webhook endpoint for handling completed async requests. Since Baseten doesn’t store model outputs, model outputs from async requests will be sent to your webhook endpoint.
Before creating your first async request, try running a sample request against your webhook endpoint to ensure that it can consume async predict results properly. Check out this example webhook test.
Call /async_predict on your model. The body of an /async_predict request includes the model input in model_input field, with the addition of a webhook endpoint (from the previous step) in the webhook_endpoint field.
Python
Copy
Ask AI
import requestsimport osmodel_id = "" # Replace this with your model IDwebhook_endpoint = "" # Replace this with your webhook endpoint URL# Read secrets from environment variablesbaseten_api_key = os.environ["BASETEN_API_KEY"]# Call the async_predict endpoint of the production deploymentresp = requests.post( f"https://model-{model_id}.api.baseten.co/production/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc },)print(resp.json())
Save the request_id from the /async_predict response to check its status or cancel it.
Set up a webhook endpoint for handling completed async requests. Since Baseten doesn’t store model outputs, model outputs from async requests will be sent to your webhook endpoint.
Before creating your first async request, try running a sample request against your webhook endpoint to ensure that it can consume async predict results properly. Check out this example webhook test.
Call /async_predict on your model. The body of an /async_predict request includes the model input in model_input field, with the addition of a webhook endpoint (from the previous step) in the webhook_endpoint field.
Python
Copy
Ask AI
import requestsimport osmodel_id = "" # Replace this with your model IDwebhook_endpoint = "" # Replace this with your webhook endpoint URL# Read secrets from environment variablesbaseten_api_key = os.environ["BASETEN_API_KEY"]# Call the async_predict endpoint of the production deploymentresp = requests.post( f"https://model-{model_id}.api.baseten.co/production/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc },)print(resp.json())
Save the request_id from the /async_predict response to check its status or cancel it.
Update your Truss’s model.py to save prediction results to cloud storage, such as S3 or GCS. We recommend implementing this in your model’s postprocess() method, which will run on CPU after the prediction has completed.
2
Setup webhook endpoint
Optionally, set up a webhook endpoint so Baseten can notify you when your async request completes.
Before creating your first async request, try running a sample request against your webhook endpoint to ensure that it can consume async predict results properly. Check out this example webhook test.
Call /async_predict on your model. The body of an /async_predict request includes the model input in model_input field, with the addition of a webhook endpoint (from the previous step) in the webhook_endpoint field.
Python
Copy
Ask AI
import requestsimport osmodel_id = "" # Replace this with your model IDwebhook_endpoint = "" # Replace this with your webhook endpoint URL# Read secrets from environment variablesbaseten_api_key = os.environ["BASETEN_API_KEY"]# Call the async_predict endpoint of the production deploymentresp = requests.post( f"https://model-{model_id}.api.baseten.co/production/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc },)print(resp.json())
Save the request_id from the /async_predict response to check its status or cancel it.
Chains: this guide is written for Truss models, but
Chains support async inference likewise. An
Chain entrypoint can be invoked via its async_run_remote endpoint, e.g.
https://chain-{chain_id}.api.baseten.co/production/async_run_remote. The
internal Chainlet-Chainlet call will still run synchronously.
Configure your webhook endpoint to handle POST requests with async predict results. We require that webhook endpoints use HTTPS.
We recommend running a sample request against your webhook endpoint to ensure that it can consume async predict results properly. Try running this webhook test.
For local development, we recommend using this Repl as a starting point. This code validates the webhook request and logs the payload.
import requestsimport osmodel_id = "" # Replace this with your model IDwebhook_endpoint = "" # Replace this with your webhook endpoint URL# Read secrets from environment variablesbaseten_api_key = os.environ["BASETEN_API_KEY"]# Call the async_predict endpoint of the production deploymentresp = requests.post( f"https://model-{model_id}.api.baseten.co/production/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc },)print(resp.json())
Python
Copy
Ask AI
import requestsimport osmodel_id = "" # Replace this with your model IDwebhook_endpoint = "" # Replace this with your webhook endpoint URL# Read secrets from environment variablesbaseten_api_key = os.environ["BASETEN_API_KEY"]# Call the async_predict endpoint of the production deploymentresp = requests.post( f"https://model-{model_id}.api.baseten.co/production/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc },)print(resp.json())
Python
Copy
Ask AI
import requestsimport osmodel_id = "" # Replace this with your model IDwebhook_endpoint = "" # Replace this with your webhook endpoint URL# Read secrets from environment variablesbaseten_api_key = os.environ["BASETEN_API_KEY"]# Call the async_predict endpoint of the development deploymentresp = requests.post( f"https://model-{model_id}.api.baseten.co/development/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc },)print(resp.json())
Python
Copy
Ask AI
import requestsimport osmodel_id = "" # Replace this with your model IDdeployment_id = "" # Replace this with your deployment IDwebhook_endpoint = "" # Replace this with your webhook endpoint URL# Read secrets from environment variablesbaseten_api_key = os.environ["BASETEN_API_KEY"]# Call the async_predict endpoint of the given deploymentresp = requests.post( f"https://model-{model_id}.api.baseten.co/deployment/{deployment_id}/async_predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, json={ "model_input": {"prompt": "hello world!"}, "webhook_endpoint": webhook_endpoint # Optional fields for priority, max_time_in_queue_seconds, etc },)print(resp.json())
Create an async request by calling a model’s /async_predict endpoint. See the async inference API reference for more endpoint details.
Baseten does not store async predict results. Ensure that prediction outputs are either processed by your webhook, or saved to cloud storage in your model code (for example, in your model’s postprocess method).
If a webhook endpoint was provided in the /async_predict request, the async predict results will be sent in a POST request to the webhook endpoint. Errors in executing the async prediction will be included in the errors field of the async predict result.
Async predict result schema:
request_id (string): the ID of the completed async request. This matches the request_id field of the /async_predict response.
model_id (string): the ID of the model that executed the request
deployment_id (string): the ID of the deployment that executed the request
type (string): the type of the async predict result. This will always be "async_request_completed", even in error cases.
time (datetime): the time in UTC at which the request was sent to the webhook
data (dict or string): the prediction output
errors (list): any errors that occurred in processing the async request
Since async predict results are sent to a webhook available to anyone over the internet with the endpoint, you’ll want to have some verification that these results sent to the webhook are actually coming from Baseten.
We recommend leveraging webhook signatures to secure webhook payloads and ensure they are from Baseten.
This is a two-step process:
Create a webhook secret.
Validate a webhook signature sent as a header along with the webhook request payload.
If a webhook secret exists, Baseten will include a webhook signature in the "X-BASETEN-SIGNATURE" header of the webhook request so you can verify that it is coming from Baseten.
A Baseten signature header looks like:
"X-BASETEN-SIGNATURE": "v1=signature"
Where signature is an HMAC generated using a SHA-256 hash function calculated over the whole async predict result and signed using a webhook secret.
If multiple webhook secrets are active, a signature will be generated using each webhook secret. In the example below, the newer webhook secret was used to create newsignature and the older (soon to expire) webhook secret was used to create oldsignature.
To validate a Baseten signature, we recommend the following. A full Baseten signature validation example can be found in this Repl.
1
Compare timestamps
Compare the async predict result timestamp with the current time and decide if it was received within an acceptable tolerance window.
Copy
Ask AI
TIMESTAMP_TOLERANCE_SECONDS = 300# Check timestamp in async predict result against current time to ensure its within our toleranceif (datetime.now(timezone.utc) - async_predict_result.time).total_seconds() > TIMESTAMP_TOLERANCE_SECONDS: logging.error( f"Async predict result was received after {TIMESTAMP_TOLERANCE_SECONDS} seconds and is considered stale, Baseten signature was not validated." )
2
Recompute Baseten signature
Recreate the Baseten signature using webhook secret(s) and the async predict result.
Copy
Ask AI
WEBHOOK_SECRETS = [] # Add your webhook secrets hereasync_predict_result_json = async_predict_result.model_dump_json()# We recompute expected Baseten signatures with each webhook secretfor webhook_secret in WEBHOOK_SECRETS: for actual_signature in baseten_signature.replace("v1=", "").split(","): expected_signature = hmac.digest( webhook_secret.encode("utf-8"), async_predict_result_json.encode("utf-8"), hashlib.sha256, ).hex()
3
Compare signatures
Compare the expected Baseten signature with the actual computed signature using compare_digest, which will return a boolean representing whether the signatures are indeed the same.
We recommend periodically rotating webhook secrets.
In the event that a webhook secret is exposed, you’re able to rotate or remove it.
Rotating a secret in the UI will set the existing webhook secret to expire in 24 hours, and generate a new webhook secret. During this period, Baseten will include multiple signatures in the signature headers.
Removing webhook secrets could cause your signature validation to fail. Recreate a webhook secret after deleting and ensure your signature validation code is up to date with the new webhook secret.