Skip to main content
Async inference is a fire and forget pattern for model requests. Instead of waiting for a response, you receive a request ID immediately while inference runs in the background. When complete, results are delivered to your webhook endpoint.
Async requests work with any deployed model, no code changes are required. Requests can queue for up to 72 hours and run for up to 1 hour. Async inference is not compatible with streaming output.
Use async inference for:
  • Long-running tasks that would otherwise hit request timeouts.
  • Batch processing where you don’t need immediate responses.
  • Priority queuing to serve VIP customers faster.
Baseten does not store model outputs. If webhook delivery fails after all retries, your data is lost. See Webhook delivery for mitigation strategies.

Quick start

1

Set up a webhook endpoint

Create an HTTPS endpoint to receive results. Use this Repl as a starting point, or deploy to any service that can receive POST requests.
2

Make an async request

Call your model’s /async_predict endpoint with your webhook URL:
import requests
import os

model_id = "YOUR_MODEL_ID"
webhook_endpoint = "YOUR_WEBHOOK_ENDPOINT"
baseten_api_key = os.environ["BASETEN_API_KEY"]

# Call the async_predict endpoint of the production deployment
resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/production/async_predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
        "model_input": {"prompt": "hello world!"},
        "webhook_endpoint": webhook_endpoint,
        # "priority": 0,
        # "max_time_in_queue_seconds": 600,
    },
)

print(resp.json())
You’ll receive a request_id immediately.
3

Receive results

When inference completes, Baseten sends a POST request to your webhook with the model output. See Webhook payload for the response format.
Chains support async inference through async_run_remote. Inference requests to the entrypoint are queued, but internal Chainlet-to-Chainlet calls run synchronously.

How async works

Async inference decouples request submission from processing, letting you queue work without waiting for results.

Request lifecycle

When you submit an async request:
  1. You call /async_predict and immediately receive a request_id.
  2. Your request enters a queue managed by the Async Request Service.
  3. A background worker picks up your request and calls your model’s predict endpoint.
  4. Your model runs inference and returns a response.
  5. Baseten sends the response to your webhook URL using POST.
The max_time_in_queue_seconds parameter controls how long a request waits before expiring. It defaults to 10 minutes but can extend to 72 hours.

Autoscaling behavior

The async queue is decoupled from model scaling. Requests queue successfully even when your model has zero replicas. When your model is scaled to zero:
  1. Your request enters the queue while the model has no running replicas.
  2. The queue processor attempts to call your model, triggering the autoscaler.
  3. Your request waits while the model cold-starts.
  4. Once the model is ready, inference runs and completes.
  5. Baseten delivers the result to your webhook.
If the model doesn’t become ready within max_time_in_queue_seconds, the request expires with status EXPIRED. Set this parameter to account for your model’s startup time. For models with long cold starts, consider keeping minimum replicas running using autoscaling settings.

Async priority

Async requests are subject to two levels of priority: how they compete with sync requests for model capacity, and how they’re ordered relative to other async requests in the queue.

Sync vs async concurrency

Sync and async requests share your model’s concurrency pool, controlled by predict_concurrency in your model configuration:
config.yaml
runtime:
  predict_concurrency: 10
The predict_concurrency setting defines how many requests your model can process simultaneously per replica. When both sync and async requests are in flight, sync requests take priority. The queue processor monitors your model’s capacity and backs off when it receives 429 responses, ensuring sync traffic isn’t starved. For example, if your model has predict_concurrency=10 and 8 sync requests are running, only 2 slots remain for async requests. The remaining async requests stay queued until capacity frees up.

Async queue priority

Within the async queue itself, you can control processing order using the priority parameter. This is useful for serving specific requests faster or ensuring critical batch jobs run before lower-priority work.
import requests
import os

model_id = "YOUR_MODEL_ID"
webhook_endpoint = "YOUR_WEBHOOK_URL"
baseten_api_key = os.environ["BASETEN_API_KEY"]

resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/production/async_predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
        "webhook_endpoint": webhook_endpoint,
        "model_input": {"prompt": "hello world!"},
        "priority": 0,
    },
)

print(resp.json())
The priority parameter accepts values 0, 1, or 2. Lower values indicate higher priority: a request with priority: 0 is processed before requests with priority: 1 or priority: 2. If you don’t specify a priority, requests default to priority 1. Use priority 0 sparingly for truly urgent requests. If all requests are marked priority 0, the prioritization has no effect.

Webhooks

Baseten delivers async results to your webhook endpoint when inference completes.

Request format

When inference completes, Baseten sends a POST request to your webhook with these headers and body:
POST /your-webhook-path HTTP/2.0
Content-Type: application/json
X-BASETEN-REQUEST-ID: 9876543210abcdef1234567890fedcba
X-BASETEN-SIGNATURE: v1=abc123...
The X-BASETEN-REQUEST-ID header contains the request ID for correlating webhooks with your original requests. The X-BASETEN-SIGNATURE header is only included if a webhook secret is configured.
Webhook endpoints must use HTTPS (except localhost for development). Baseten supports HTTP/2 and HTTP/1.1 connections.
{
  "request_id": "9876543210abcdef1234567890fedcba",
  "model_id": "abc123",
  "deployment_id": "def456",
  "type": "async_request_completed",
  "time": "2024-04-30T01:01:08.883423Z",
  "data": { "output": "model response here" },
  "errors": []
}
The body contains the request_id matching your original /async_predict response, along with model_id and deployment_id identifying which deployment ran the request. The data field contains your model output, or null if an error occurred. The errors array is empty on success, or contains error objects on failure.

Webhook delivery

If all delivery attempts fail, your model output is permanently lost.
Baseten delivers webhooks on a best-effort basis with automatic retries:
SettingValue
Total attempts3 (1 initial + 2 retries).
Backoff1 second, then 4 seconds.
Timeout10 seconds per attempt.
Retryable codes500, 502, 503, 504.
To prevent data loss:
  1. Save outputs in your model. Use the postprocess() function to write to cloud storage:
import json
import boto3

class Model:
  # ...
    def postprocess(self, model_output):
        s3 = boto3.client("s3")
        s3.put_object(
            Bucket="my-bucket",
            Key=f"outputs/{self.context.get('request_id')}.json",
            Body=json.dumps(model_output)
        )
        return model_output
This will process your model output and save it to your desired location. The postprocess method runs after inference completes. Use self.context.get('request_id') to access the async request ID for correlating outputs with requests.
  1. Use a reliable endpoint. Deploy your webhook to a highly available service like a cloud function or message queue.

Secure webhooks

Create a webhook secret in the Secrets tab to verify requests are from Baseten. When configured, Baseten includes an X-BASETEN-SIGNATURE header:
X-BASETEN-SIGNATURE: v1=abc123...
To validate, compute an HMAC-SHA256 of the request body using your secret and compare:
import hashlib
import hmac

def verify_signature(body: bytes, signature: str, secret: str) -> bool:
    expected = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    actual = signature.replace("v1=", "").split(",")[0]
    return hmac.compare_digest(expected, actual)
The function computes an HMAC-SHA256 hash of the raw request body using your webhook secret. It extracts the signature value after v1= and uses compare_digest for timing-safe comparison to prevent timing attacks. Rotate secrets periodically. During rotation, both old and new secrets remain valid for 24 hours.

Manage requests

You can check the status of async requests or cancel them while they’re queued.

Check request status

To check the status of an async request, call the status endpoint with your request ID:
import requests
import os

model_id = "YOUR_MODEL_ID"
request_id = "YOUR_REQUEST_ID"
baseten_api_key = os.environ["BASETEN_API_KEY"]

resp = requests.get(
    f"https://model-{model_id}.api.baseten.co/async_request/{request_id}",
    headers={"Authorization": f"Api-Key {baseten_api_key}"}
)

print(resp.json())
Status is available for 1 hour after completion. See the status API reference for details.
StatusDescription
QUEUEDWaiting in queue.
IN_PROGRESSCurrently processing.
SUCCEEDEDCompleted successfully.
FAILEDFailed after retries.
EXPIREDExceeded max_time_in_queue_seconds.
CANCELEDCanceled by user.
WEBHOOK_FAILEDInference succeeded but webhook delivery failed.

Cancel a request

Only QUEUED requests can be canceled. To cancel a request, call the cancel endpoint with your request ID:
import requests
import os

model_id = "YOUR_MODEL_ID"
request_id = "YOUR_REQUEST_ID"
baseten_api_key = os.environ["BASETEN_API_KEY"]

resp = requests.delete(
    f"https://model-{model_id}.api.baseten.co/async_request/{request_id}",
    headers={"Authorization": f"Api-Key {baseten_api_key}"}
)

print(resp.json())
For more information, see the cancel async request API reference.

Error codes

When inference fails, the webhook payload returns an errors array:
{
  "errors": [{ "code": "MODEL_PREDICT_ERROR", "message": "Details here" }]
}
CodeHTTPDescriptionRetried
MODEL_NOT_READY400Model is loading or starting.Yes
MODEL_DOES_NOT_EXIST404Model or deployment not found.No
MODEL_INVALID_INPUT422Invalid input format.No
MODEL_PREDICT_ERROR500Exception in model.predict().Yes
MODEL_UNAVAILABLE502/503Model crashed or scaling.Yes
MODEL_PREDICT_TIMEOUT504Inference exceeded timeout.Yes

Inference retries

When inference fails with a retryable error, Baseten automatically retries the request using exponential backoff. Configure this behavior with inference_retry_config:
import requests
import os

model_id = "YOUR_MODEL_ID"
webhook_endpoint = "YOUR_WEBHOOK_URL"
baseten_api_key = os.environ["BASETEN_API_KEY"]

resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/production/async_predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
        "model_input": {"prompt": "hello world!"},
        "webhook_endpoint": webhook_endpoint,
        "inference_retry_config": {
            "max_attempts": 3,
            "initial_delay_ms": 1000,
            "max_delay_ms": 5000
        }
    },
)

print(resp.json())
ParameterRangeDefaultDescription
max_attempts1-103Total inference attempts including the original.
initial_delay_ms0-10,0001000Delay before the first retry (ms).
max_delay_ms0-60,0005000Maximum delay between retries (ms).
Retries use exponential backoff with a multiplier of 2. With the default configuration, delays progress as: 1s β†’ 2s β†’ 4s β†’ 5s (capped at max_delay_ms). Only requests that fail with retryable error codes (500, 502, 503, 504) are retried. Non-retryable errors like invalid input (422) or model not found (404) fail immediately.
Inference retries are distinct from webhook delivery retries. Inference retries happen when calling your model fails. Webhook retries happen when delivering results to your endpoint fails.

Rate limits

There are rate limits for the async predict endpoint and the status polling endpoint. If you exceed these limits, you will receive a 429 status code.
EndpointLimit
Predict endpoint requests (/async_predict)12,000 requests/minute (org-level).
Status polling20 requests/second.
Cancel request20 requests/second.
Use webhooks instead of polling to avoid status endpoint limits. Contact [email protected] to request increases.

Observability

Async metrics are available on the Metrics tab of your model dashboard:
  • Inference latency/volume: includes async requests.
  • Time in async queue: time spent in QUEUED state.
  • Async queue size: number of queued requests.

Async queue metrics in the dashboard

Resources

For more information and resources, see the following: