Implementation (Advanced)
Request concurrency
A guide to setting concurrency for your model
Configuring concurrency optimizes model performance, balancing throughput and latency.
In Baseten & Truss, concurrency is managed at two levels:
- Concurrency Target β Limits the number of requests sent to a single replica.
- Predict Concurrency β Limits how many requests the predict function handles inside the model container.
1. Concurrency Target
- Set in the Baseten UI β Defines how many requests a single replica can process at once.
- Triggers autoscaling β If all replicas hit the concurrency target, additional replicas spin up.
Example:
- Concurrency Target = 2, Single Replica
- 5 requests arrive β 2 are processed immediately, 3 are queued.
- If max replicas arenβt reached, autoscaling spins up a new replica.
2. Predict Concurrency
- Set in
config.yaml
β Controls how many requests can be processed by predict simultaneously. - Protects GPU resources β Prevents multiple requests from overloading the GPU.
Configuring Predict Concurrency
config.yaml
How It Works Inside a Model Pod
- Requests arrive β All begin preprocessing (e.g., downloading images from S3).
- Predict runs on GPU β Limited by
predict_concurrency
. - Postprocessing begins β Can run while other requests are still in inference.
When to Use Predict Concurrency
- β Protect GPU resources β Prevent multiple requests from degrading performance.
- β Allow parallel preprocessing/postprocessing β I/O tasks can continue even when inference is blocked.
Ensure
Concurrency Target
is set high enough to send enough requests to the container.Was this page helpful?