Skip to main content

Auto-Scaling Engines

Beyond the Introduction to autoscaling, some adjustments specialized to models using dynamic batching are helpful. Both BEI and Engine-Builder-LLM use dynamic batching to process parallel multiple requests. This increase in throughput comes at the cost of increased p50 latency. Combining this feature with engine-specific autoscaling becomes a powerful tool for maintaining optimal performance across varying traffic patterns.

BEI

BEI provides millisecond-range inference times and scales differently than other models. With too few replicas, backpressure can build up quickly. Key recommendations:
  • Enable autoscaling - BEI’s millisecond-range inference and dynamic batching require autoscaling to handle variable traffic efficiently
  • Target utilization: 25% - Low target provides headroom for traffic spikes and accommodates dynamic batching behavior
  • Concurrency: 96+ requests - High concurrency allows maximum throughput (default: 256)
  • Minimum concurrency: ≥8 - Never set below 8 for optimal performance
Multi-payload routes (/rerank, /v1/embeddings) can send multiple requests at once, challenging autoscaling based on concurrent requests. Use the Performance client for optimal scaling.

Engine-Builder-LLM

Engine-Builder-LLM uses dynamic batching to maximize throughput, similar to BEI, but doesn’t face the multi-payload challenge that BEI does with /rerank and /v1/embeddings routes. Key recommendations:
  • Target utilization: 40-50% - Lower than default to accommodate dynamic batching and provide headroom
  • Concurrency: 32-256 requests - Default 256 works well for most workloads
  • Batch cases - Use the Performance client for batch processing
Important: Do not set concurrency above max_batch_size as it leads to on-replica queueing and negates the benefits of autoscaling.

Quick Reference

SettingBEIEngine-Builder-LLM
Target utilization25%40-50%
Concurrency96+ (min ≥8)32-256
Batch sizeFlexibleFlexible

Further reading