Configuring concurrency optimizes model performance, balancing throughput and latency.
In Baseten & Truss, concurrency is managed at two levels:
- Concurrency Target – Limits the number of requests sent to a single replica.
- Predict Concurrency – Limits how many requests the predict function handles inside the model container.
1. Concurrency Target
- Set in the Baseten UI – Defines how many requests a single replica can process at once.
- Triggers autoscaling – If all replicas hit the concurrency target, additional replicas spin up.
Example:
- Concurrency Target = 2, Single Replica
- 5 requests arrive → 2 are processed immediately, 3 are queued.
- If max replicas aren’t reached, autoscaling spins up a new replica.
2. Predict Concurrency
- Set in
config.yaml
– Controls how many requests can be processed by predict simultaneously.
- Protects GPU resources – Prevents multiple requests from overloading the GPU.
Configuring Predict Concurrency
model_name: "My model with concurrency limits"
runtime:
predict_concurrency: 2 # Default is 1
How It Works Inside a Model Pod
- Requests arrive → All begin preprocessing (e.g., downloading images from S3).
- Predict runs on GPU → Limited by
predict_concurrency
.
- Postprocessing begins → Can run while other requests are still in inference.
When to Use Predict Concurrency
- ✅ Protect GPU resources – Prevent multiple requests from degrading performance.
- ✅ Allow parallel preprocessing/postprocessing – I/O tasks can continue even when inference is blocked.
Ensure Concurrency Target
is set high enough to send enough requests to the container.
Responses are generated using AI and may contain mistakes.