Request concurrency

Configuring concurrency optimizes model performance, balancing throughput and latency.

In Baseten & Truss, concurrency is managed at two levels:

Concurrency Target – Limits the number of requests sent to a single replica.
Predict Concurrency – Limits how many requests the predict function handles inside the model container.

1. Concurrency Target

Set in the Baseten UI – Defines how many requests a single replica can process at once.
Triggers autoscaling – If all replicas hit the concurrency target, additional replicas spin up.

Example:

Set in config.yaml – Controls how many requests can be processed by predict simultaneously.
Protects GPU resources – Prevents multiple requests from overloading the GPU.

config.yaml

model_name: "My model with concurrency limits"
runtime:
  predict_concurrency: 2  # Default is 1

Requests arrive → All begin preprocessing (e.g., downloading images from S3).
Predict runs on GPU → Limited by predict_concurrency.
Postprocessing begins → Can run while other requests are still in inference.

✅ Protect GPU resources – Prevent multiple requests from degrading performance.
✅ Allow parallel preprocessing/postprocessing – I/O tasks can continue even when inference is blocked.

Ensure Concurrency Target is set high enough to send enough requests to the container.