Skip to main content
Configuring concurrency optimizes model performance, balancing throughput and latency. In Baseten and Truss, concurrency is managed at two levels:
  1. Concurrency target – Limits the number of requests sent to a single replica.
  2. Predict concurrency – Limits how many requests the predict function handles inside the model container.

Concurrency target

  • Set in the Baseten UI or API – Defines how many requests a single replica handles before autoscaling triggers.
  • Triggers autoscaling – When all replicas hit the concurrency target, additional replicas spin up.
Example:
  • Concurrency target = 2, single replica
  • 5 requests arrive → 2 are processed immediately, 3 are queued
  • If max replicas aren’t reached, autoscaling spins up a new replica
For full autoscaling configuration, see Autoscaling.

Predict concurrency

predict_concurrency is only available for Python Truss models. It is not configurable for custom servers or TensorRT-LLM Engine Builder deployments.
  • Set in config.yaml – Controls how many requests can be processed by predict simultaneously inside the container.
  • Protects GPU resources – Prevents multiple requests from overloading the GPU.

Configuring predict concurrency

config.yaml
model_name: "My model with concurrency limits"
runtime:
  predict_concurrency: 2  # Default is 1

How it works inside a model pod

  1. Requests arrive → All begin preprocessing (e.g., downloading images from S3).
  2. Predict runs on GPU → Limited by predict_concurrency.
  3. Postprocessing begins → Can run while other requests are still in inference.

When to use predict concurrency > 1

  • I/O-heavy preprocessing – If preprocessing involves network calls or disk I/O, higher predict_concurrency allows these to overlap with GPU inference.
  • Batching engines – Engines like vLLM and BEI handle batching internally and benefit from higher concurrency.
The concurrency target should be ≥ predict_concurrency to ensure enough requests reach the container. If concurrency target is lower, requests will queue at the platform level before the container can batch them.

Finding your concurrency target

The concurrency target is the most important autoscaling setting. It determines how hard each replica works and directly affects cost and latency.

The benchmarking process

To find the optimal concurrency target for your model:
  1. Deploy with high concurrency – Set a high concurrency target (e.g., 64 or 128) and min/max replicas to 1. This lets you probe a single replica’s capacity without autoscaling interfering.
  2. Use realistic traffic – Benchmark with your actual request distribution: real input sizes, output lengths, streaming vs non-streaming, tool calls if applicable. Synthetic “hello world” requests process too quickly and give misleading results.
  3. Measure latency at increasing load – Send traffic at increasing concurrency levels (1, 2, 4, 8, 16, 32…) and record p50, p95, and p99 latency at each level.
  4. Find the knee – Plot latency vs concurrency. Look for the “knee” where latency starts to degrade sharply. This is the point where queueing begins to dominate.
  5. Set concurrency at or below the knee – Your concurrency target should be at or slightly below this point to maintain acceptable latency under normal load.

The “knee of the curve”

At low concurrency, adding more concurrent requests has minimal impact on latency because the model has spare capacity. At some point, the model becomes saturated and latency increases sharply with each additional request. The concurrency target should be set just before this inflection point.

If you can’t benchmark

If you don’t have time to benchmark, use these starting points based on model type:
Model typeStarting concurrencyRationale
Standard Truss model1Conservative default
vLLM / LLM inference32–128Balances batching with latency
SGLang32Moderate throughput
Text embeddings (TEI)32Aligned with max-client-batch-size
BEI embeddings96+ (min ≥ 8)High throughput, millisecond inference
Whisper (async batch)256Audio with batching
Image generation (SDXL)1VRAM intensive
Start with conservative settings. It’s easier to raise concurrency (improve cost) than to debug latency issues caused by overconfigured concurrency.

Computing min and max replicas

Once you know your concurrency target and throughput per replica, you can compute replica counts:
throughput_per_replica = requests_per_second at target concurrency

min_replicas = (baseline_rps / throughput_per_replica) + 1  # +1 for redundancy
max_replicas = (peak_rps / throughput_per_replica) + buffer
Example:
  • Your model handles 10 req/s at concurrency 32
  • Baseline traffic is 25 req/s, peak is 80 req/s
  • min_replicas = (25 / 10) + 1 = 4
  • max_replicas = (80 / 10) + 2 = 10
For pattern-specific guidance, see Traffic patterns.