Autoscaling engines

BEI and Engine-Builder-LLM use dynamic batching to process multiple requests in parallel. This increases throughput but requires different autoscaling settings than standard models.

Quick reference

Setting	BEI	Engine-Builder-LLM
Target utilization	25%	40–50%
Concurrency target	96+ (min ≥ 8)	32–256
Special considerations	Use Performance client for multi-payload routes	Never exceed max_batch_size

For general autoscaling concepts, see Autoscaling.

BEI

BEI provides millisecond-range inference times and scales differently than other models. With too few replicas, backpressure can build up quickly.

Recommendations

Setting	Value	Why
Target utilization	25%	Low target provides headroom for traffic spikes
Concurrency target	96+ (min ≥ 8)	High concurrency allows maximum throughput
Autoscaling	Enabled	Required for variable traffic

Multi-payload routes

The /rerank and /v1/embeddings routes can send multiple items per request, which challenges request-based autoscaling. Each API call counts as one request regardless of how many items it contains. Use the Performance client for optimal scaling with multi-payload routes.

Engine-Builder-LLM

Engine-Builder-LLM uses dynamic batching similar to BEI but doesn’t face the multi-payload challenge.

Recommendations

Setting	Value	Why
Target utilization	40–50%	Accommodates dynamic batching behavior
Concurrency target	32–256	Match or stay below max_batch_size
Min concurrency	≥ 8	Optimal performance floor

Never set concurrency target above max_batch_size. This causes on-replica queueing and negates the benefits of autoscaling. If your max_batch_size is 64, keep concurrency target at 64 or below.

Lookahead decoding

If using lookahead decoding, set concurrency target to the same or slightly below max_batch_size. This allows lookahead to perform optimizations. This guidance applies to all Engine-Builder-LLM deployments, not just those using lookahead.

Autoscaling: Full parameter reference.
Traffic patterns: Pattern-specific settings.
BEI overview: General BEI documentation.
Engine-Builder-LLM overview: Generation model details.
Performance client: Client usage for batch processing.

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

Autoscaling engines

Quick reference

BEI

Recommendations

Multi-payload routes

Engine-Builder-LLM

Recommendations

Lookahead decoding

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

​Quick reference

​BEI

​Recommendations

​Multi-payload routes

​Engine-Builder-LLM

​Recommendations

​Lookahead decoding

​Related

Quick reference

BEI

Recommendations

Multi-payload routes

Engine-Builder-LLM

Recommendations

Lookahead decoding

Related