> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Autoscaling engines

> Engine-specific autoscaling settings for BEI and Engine-Builder-LLM

BEI and Engine-Builder-LLM use **dynamic batching** to process multiple requests in parallel. This increases throughput but requires different autoscaling settings than standard models.

## Quick reference

| Setting                    | BEI                                             | Engine-Builder-LLM            |
| -------------------------- | ----------------------------------------------- | ----------------------------- |
| **Target utilization**     | 25%                                             | 40–50%                        |
| **Concurrency target**     | 96+ (min ≥ 8)                                   | 32–256                        |
| **Special considerations** | Use Performance client for multi-payload routes | Never exceed max\_batch\_size |

For general autoscaling concepts, see [Autoscaling](/deployment/autoscaling/overview).

***

## BEI

BEI provides millisecond-range inference times and scales differently than other models. With too few replicas, backpressure can build up quickly.

### Recommendations

| Setting            | Value             | Why                                             |
| ------------------ | ----------------- | ----------------------------------------------- |
| Target utilization | **25%**           | Low target provides headroom for traffic spikes |
| Concurrency target | **96+** (min ≥ 8) | High concurrency allows maximum throughput      |
| Autoscaling        | **Enabled**       | Required for variable traffic                   |

### Multi-payload routes

The `/rerank` and `/v1/embeddings` routes can send multiple items per request, which challenges request-based autoscaling. Each API call counts as one request regardless of how many items it contains.

Use the [Performance client](/inference/performance-client) for optimal scaling with multi-payload routes.

***

## Engine-Builder-LLM

Engine-Builder-LLM uses dynamic batching similar to BEI but doesn't face the multi-payload challenge.

### Recommendations

| Setting            | Value      | Why                                    |
| ------------------ | ---------- | -------------------------------------- |
| Target utilization | **40–50%** | Accommodates dynamic batching behavior |
| Concurrency target | **32–256** | Match or stay below max\_batch\_size   |
| Min concurrency    | **≥ 8**    | Optimal performance floor              |

<Warning>
  **Never set concurrency target above `max_batch_size`.** This causes on-replica queueing and negates the benefits of autoscaling. If your max\_batch\_size is 64, keep concurrency target at 64 or below.
</Warning>

### Lookahead decoding

If using lookahead decoding, set concurrency target to the same or slightly below `max_batch_size`. This allows lookahead to perform optimizations. This guidance applies to all Engine-Builder-LLM deployments, not just those using lookahead.

***

## Related

* [Autoscaling](/deployment/autoscaling/overview): Full parameter reference.
* [Traffic patterns](/deployment/autoscaling/traffic-patterns): Pattern-specific settings.
* [BEI overview](/engines/bei/overview): General BEI documentation.
* [Engine-Builder-LLM overview](/engines/engine-builder-llm/overview): Generation model details.
* [Performance client](/inference/performance-client): Client usage for batch processing.
