BEI and Engine-Builder-LLM use dynamic batching to process multiple requests in parallel. This increases throughput but requires different autoscaling settings than standard models.
Quick reference
| Setting | BEI | Engine-Builder-LLM |
|---|
| Target utilization | 25% | 40–50% |
| Concurrency target | 96+ (min ≥ 8) | 32–256 |
| Special considerations | Use Performance client for multi-payload routes | Never exceed max_batch_size |
For general autoscaling concepts, see Autoscaling.
BEI
BEI provides millisecond-range inference times and scales differently than other models. With too few replicas, backpressure can build up quickly.
Recommendations
| Setting | Value | Why |
|---|
| Target utilization | 25% | Low target provides headroom for traffic spikes |
| Concurrency target | 96+ (min ≥ 8) | High concurrency allows maximum throughput |
| Autoscaling | Enabled | Required for variable traffic |
Multi-payload routes
The /rerank and /v1/embeddings routes can send multiple items per request, which challenges request-based autoscaling. Each API call counts as one request regardless of how many items it contains.
Use the Performance client for optimal scaling with multi-payload routes.
Engine-Builder-LLM
Engine-Builder-LLM uses dynamic batching similar to BEI but doesn’t face the multi-payload challenge.
Recommendations
| Setting | Value | Why |
|---|
| Target utilization | 40–50% | Accommodates dynamic batching behavior |
| Concurrency target | 32–256 | Match or stay below max_batch_size |
| Min concurrency | ≥ 8 | Optimal performance floor |
Never set concurrency target above max_batch_size. This causes on-replica queueing and negates the benefits of autoscaling. If your max_batch_size is 64, keep concurrency target at 64 or below.
Lookahead decoding
If using lookahead decoding, set concurrency target to the same or slightly below max_batch_size. This allows lookahead to perform optimizations. This guidance applies to all Engine-Builder-LLM deployments, not just those using lookahead.