- Concurrency target – Limits the number of requests sent to a single replica.
- Predict concurrency – Limits how many requests the predict function handles inside the model container.
Concurrency target
- Set in the Baseten UI or API – Defines how many requests a single replica handles before autoscaling triggers.
- Triggers autoscaling – When all replicas hit the concurrency target, additional replicas spin up.
- Concurrency target = 2, single replica
- 5 requests arrive → 2 are processed immediately, 3 are queued
- If max replicas aren’t reached, autoscaling spins up a new replica
Predict concurrency
predict_concurrency is only available for Python Truss models. It is not configurable for custom servers or TensorRT-LLM Engine Builder deployments.- Set in
config.yaml– Controls how many requests can be processed by predict simultaneously inside the container. - Protects GPU resources – Prevents multiple requests from overloading the GPU.
Configuring predict concurrency
config.yaml
How it works inside a model pod
- Requests arrive → All begin preprocessing (e.g., downloading images from S3).
- Predict runs on GPU → Limited by
predict_concurrency. - Postprocessing begins → Can run while other requests are still in inference.
When to use predict concurrency > 1
- I/O-heavy preprocessing – If preprocessing involves network calls or disk I/O, higher predict_concurrency allows these to overlap with GPU inference.
- Batching engines – Engines like vLLM and BEI handle batching internally and benefit from higher concurrency.
Finding your concurrency target
The concurrency target is the most important autoscaling setting. It determines how hard each replica works and directly affects cost and latency.The benchmarking process
To find the optimal concurrency target for your model:- Deploy with high concurrency – Set a high concurrency target (e.g., 64 or 128) and min/max replicas to 1. This lets you probe a single replica’s capacity without autoscaling interfering.
- Use realistic traffic – Benchmark with your actual request distribution: real input sizes, output lengths, streaming vs non-streaming, tool calls if applicable. Synthetic “hello world” requests process too quickly and give misleading results.
- Measure latency at increasing load – Send traffic at increasing concurrency levels (1, 2, 4, 8, 16, 32…) and record p50, p95, and p99 latency at each level.
- Find the knee – Plot latency vs concurrency. Look for the “knee” where latency starts to degrade sharply. This is the point where queueing begins to dominate.
- Set concurrency at or below the knee – Your concurrency target should be at or slightly below this point to maintain acceptable latency under normal load.
The “knee of the curve”
At low concurrency, adding more concurrent requests has minimal impact on latency because the model has spare capacity. At some point, the model becomes saturated and latency increases sharply with each additional request. The concurrency target should be set just before this inflection point.If you can’t benchmark
If you don’t have time to benchmark, use these starting points based on model type:| Model type | Starting concurrency | Rationale |
|---|---|---|
| Standard Truss model | 1 | Conservative default |
| vLLM / LLM inference | 32–128 | Balances batching with latency |
| SGLang | 32 | Moderate throughput |
| Text embeddings (TEI) | 32 | Aligned with max-client-batch-size |
| BEI embeddings | 96+ (min ≥ 8) | High throughput, millisecond inference |
| Whisper (async batch) | 256 | Audio with batching |
| Image generation (SDXL) | 1 | VRAM intensive |
Computing min and max replicas
Once you know your concurrency target and throughput per replica, you can compute replica counts:- Your model handles 10 req/s at concurrency 32
- Baseline traffic is 25 req/s, peak is 80 req/s
- min_replicas = (25 / 10) + 1 = 4
- max_replicas = (80 / 10) + 2 = 10