Beyond the Introduction to autoscaling, some adjustments specialized to models using dynamic batching are helpful.Both BEI and Engine-Builder-LLM use dynamic batching to process parallel multiple requests. This increase in throughput comes at the cost of increased p50 latency.
Combining this feature with engine-specific autoscaling becomes a powerful tool for maintaining optimal performance across varying traffic patterns.
BEI provides millisecond-range inference times and scales differently than other models. With too few replicas, backpressure can build up quickly.Key recommendations:
Enable autoscaling - BEI’s millisecond-range inference and dynamic batching require autoscaling to handle variable traffic efficiently
Target utilization: 25% - Low target provides headroom for traffic spikes and accommodates dynamic batching behavior
Concurrency: 96+ requests - High concurrency allows maximum throughput (default: 256)
Minimum concurrency: ≥8 - Never set below 8 for optimal performance
Multi-payload routes (/rerank, /v1/embeddings) can send multiple requests at once, challenging autoscaling based on concurrent requests. Use the Performance client for optimal scaling.
Engine-Builder-LLM uses dynamic batching to maximize throughput, similar to BEI, but doesn’t face the multi-payload challenge that BEI does with /rerank and /v1/embeddings routes.Key recommendations:
Target utilization: 40-50% - Lower than default to accommodate dynamic batching and provide headroom
Concurrency: 32-256 requests - Default 256 works well for most workloads
Batch cases - Use the Performance client for batch processing
Important: Do not set concurrency above max_batch_size as it leads to on-replica queueing and negates the benefits of autoscaling.