Configuring concurrency optimizes model performance, balancing throughput and latency. In Baseten & Truss, concurrency is managed at two levels:
  1. Concurrency Target โ€“ Limits the number of requests sent to a single replica.
  2. Predict Concurrency โ€“ Limits how many requests the predict function handles inside the model container.

1. Concurrency Target

  • Set in the Baseten UI โ€“ Defines how many requests a single replica can process at once.
  • Triggers autoscaling โ€“ If all replicas hit the concurrency target, additional replicas spin up.
Example:
  • Concurrency Target = 2, Single Replica
  • 5 requests arrive โ†’ 2 are processed immediately, 3 are queued.
  • If max replicas arenโ€™t reached, autoscaling spins up a new replica.

2. Predict Concurrency

  • Set in config.yaml โ€“ Controls how many requests can be processed by predict simultaneously.
  • Protects GPU resources โ€“ Prevents multiple requests from overloading the GPU.

Configuring Predict Concurrency

config.yaml
model_name: "My model with concurrency limits"
runtime:
  predict_concurrency: 2  # Default is 1

How It Works Inside a Model Pod

  1. Requests arrive โ†’ All begin preprocessing (e.g., downloading images from S3).
  2. Predict runs on GPU โ†’ Limited by predict_concurrency.
  3. Postprocessing begins โ†’ Can run while other requests are still in inference.

When to Use Predict Concurrency

  • โœ… Protect GPU resources โ€“ Prevent multiple requests from degrading performance.
  • โœ… Allow parallel preprocessing/postprocessing โ€“ I/O tasks can continue even when inference is blocked.
Ensure Concurrency Target is set high enough to send enough requests to the container.