Async requests are a “fire and forget” way of executing model inference requests. Instead of waiting for a response from a model, making an async request queues the request, and immediately returns with a request identifier. Optionally, async request results are sent via a POST request to a user-defined webhook upon completion.

Use async requests for:

  • Long-running inference tasks that may otherwise hit request timeouts.
  • Batched inference jobs.
  • Prioritizing certain inference requests.

Async fast facts:

  • Async requests can be made to any model—no model code changes necessary.
  • Async requests can remain queued for up to 72 hours and run for up to 1 hour.
  • Async requests are not compatible with streaming model output.
  • Async request inputs and model outputs are not stored after an async request has been completed. Instead, model outputs will be sent to your webhook via a POST request.

Quick start

There are two ways to use async inference:

  1. Provide a webhook endpoint where model outputs will be sent via a POST request. If providing a webhook, you can use async inference on any model, without making any changes to your model code.
  2. Inside your Truss’ model.py, save prediction results to cloud storage. If a webhook endpoint is provided, your model outputs will also be sent to your webhook.

Note that Baseten does not store model outputs. If you do not wish to use a webhook, your model.py must write model outputs to a cloud storage bucket or database as part of its implementation.

1

Setup webhook endpoint

Set up a webhook endpoint for handling completed async requests. Since Baseten doesn’t store model outputs, model outputs from async requests will be sent to your webhook endpoint.

Before creating your first async request, try running a sample request against your webhook endpoint to ensure that it can consume async predict results properly. Check out this example webhook test.

We recommend using this Repl as a starting point.

2

Schedule an async predict request

Call /async_predict on your model. The body of an /async_predict request includes the model input in model_input field, with the addition of a webhook endpoint (from the previous step) in the webhook_endpoint field.

Python
import requests
import os

model_id = ""  # Replace this with your model ID
webhook_endpoint = ""  # Replace this with your webhook endpoint URL
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]

# Call the async_predict endpoint of the production deployment
resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/production/async_predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
        "model_input": {"prompt": "hello world!"},
        "webhook_endpoint": webhook_endpoint
        # Optional fields for priority, max_time_in_queue_seconds, etc
    },
)

print(resp.json())

Save the request_id from the /async_predict response to check its status or cancel it.

201
{
  "request_id": "9876543210abcdef1234567890fedcba"
}

See the async inference API reference for more endpoint details.

3

Check async predict results

Using the request_id saved from the previous step, check the status of your async predict request:

Python
import requests
import os

model_id = ""
request_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]

resp = requests.get(
    f"https://model-{model_id}.api.baseten.co/async_request/{request_id}",
    headers={"Authorization": f"Api-Key {baseten_api_key}"}
)

print(resp.json())

Once your model has finished executing the request, the async predict result will be sent to your webhook in a POST request.

{
  "request_id": "9876543210abcdef1234567890fedcba",
  "model_id": "my_model_id",
  "deployment_id": "my_deployment_id",
  "type": "async_request_completed",
  "time": "2024-04-30T01:01:08.883423Z",
  "data": {
    "my_model_output": "hello world!"
  },
  "errors": []
}
4

Secure your webhook

We strongly recommend securing the requests sent to your webhooks to validate that they are from Baseten.

For instructions, see our guide to securing async requests.

Chains: this guide is written for Truss models, but Chains support async inference likewise. An Chain entrypoint can be invoked via its async_run_remote endpoint, e.g. https://chain-{chain_id}.api.baseten.co/production/run_run_remote. The internal Chainlet-Chainlet call will still run synchronously.

User guide

Configuring the webhook endpoint

Configure your webhook endpoint to handle POST requests with async predict results. We require that webhook endpoints use HTTPS.

We recommend running a sample request against your webhook endpoint to ensure that it can consume async predict results properly. Try running this webhook test.

For local development, we recommend using this Repl as a starting point. This code validates the webhook request and logs the payload.

Making async requests

Python
import requests
import os

model_id = ""  # Replace this with your model ID
webhook_endpoint = ""  # Replace this with your webhook endpoint URL
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]

# Call the async_predict endpoint of the production deployment
resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/production/async_predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
        "model_input": {"prompt": "hello world!"},
        "webhook_endpoint": webhook_endpoint
        # Optional fields for priority, max_time_in_queue_seconds, etc
    },
)

print(resp.json())

Create an async request by calling a model’s /async_predict endpoint. See the async inference API reference for more endpoint details.

Getting and canceling async requests

You may get the status of an async request for up to 1 hour after the request has been completed.
Python
import requests
import os

model_id = ""
request_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]

resp = requests.get(
    f"https://model-{model_id}.api.baseten.co/async_request/{request_id}",
    headers={"Authorization": f"Api-Key {baseten_api_key}"}
)

print(resp.json())

Manage async requests using the get async request API endpoint and the cancel async request API endpoint.

Processing async predict results

Baseten does not store async predict results. Ensure that prediction outputs are either processed by your webhook, or saved to cloud storage in your model code (for example, in your model’s postprocess method).

If a webhook endpoint was provided in the /async_predict request, the async predict results will be sent in a POST request to the webhook endpoint. Errors in executing the async prediction will be included in the errors field of the async predict result.

Async predict result schema:

  • request_id (string): the ID of the completed async request. This matches the request_id field of the /async_predict response.
  • model_id (string): the ID of the model that executed the request
  • deployment_id (string): the ID of the deployment that executed the request
  • type (string): the type of the async predict result. This will always be "async_request_completed", even in error cases.
  • time (datetime): the time in UTC at which the request was sent to the webhook
  • data (dict or string): the prediction output
  • errors (list): any errors that occurred in processing the async request

Example async predict result:

{
  "request_id": "9876543210abcdef1234567890fedcba",
  "model_id": "my_model_id",
  "deployment_id": "my_deployment_id",
  "type": "async_request_completed",
  "time": "2024-04-30T01:01:08.883423Z",
  "data": {
    "my_model_output": "hello world!"
  },
  "errors": []
}

Observability

Metrics for async request execution are available on the Metrics tab of your model dashboard.

  • Async requests are included in inference latency and volume metrics.
  • A time in async queue chart displays the time an async predict request spent in the QUEUED state before getting processed by the model.
  • A async queue size chart displays the current number of queued async predict requests.

The time in async queue chart.