Skip to main content
BEI and Engine-Builder-LLM use dynamic batching to process multiple requests in parallel. This increases throughput but requires different autoscaling settings than standard models.

Quick reference

SettingBEIEngine-Builder-LLM
Target utilization25%40–50%
Concurrency target96+ (min ≥ 8)32–256
Special considerationsUse Performance client for multi-payload routesNever exceed max_batch_size
For general autoscaling concepts, see Autoscaling.

BEI

BEI provides millisecond-range inference times and scales differently than other models. With too few replicas, backpressure can build up quickly.

Recommendations

SettingValueWhy
Target utilization25%Low target provides headroom for traffic spikes
Concurrency target96+ (min ≥ 8)High concurrency allows maximum throughput
AutoscalingEnabledRequired for variable traffic

Multi-payload routes

The /rerank and /v1/embeddings routes can send multiple items per request, which challenges request-based autoscaling. Each API call counts as one request regardless of how many items it contains. Use the Performance client for optimal scaling with multi-payload routes.

Engine-Builder-LLM

Engine-Builder-LLM uses dynamic batching similar to BEI but doesn’t face the multi-payload challenge.

Recommendations

SettingValueWhy
Target utilization40–50%Accommodates dynamic batching behavior
Concurrency target32–256Match or stay below max_batch_size
Min concurrency≥ 8Optimal performance floor
Never set concurrency target above max_batch_size. This causes on-replica queueing and negates the benefits of autoscaling. If your max_batch_size is 64, keep concurrency target at 64 or below.

Lookahead decoding

If using lookahead decoding, set concurrency target to the same or slightly below max_batch_size. This allows lookahead to perform optimizations. This guidance applies to all Engine-Builder-LLM deployments, not just those using lookahead.