POST /v1/llm_models) under the llm_config block, not through Truss config.yaml. To enable any of them on your deployment, contact your Baseten representative.
KV-aware routing
Long prompts repeat context across requests. Without cache-aware routing, each worker rebuilds KV state from scratch on every request, even when another worker in the deployment already has the prefix cached. The KV-aware router maintains a real-time index of every worker’s KV cache contents and picks the worker most likely to serve a request from cache.How it works
The router runs as a stateful service in front of the BIS-LLM worker pool. For each incoming request:- The frontend tokenizes the prompt and calls the router for a worker assignment.
- The router scores each worker against the prompt’s tokens using a radix tree that indexes every worker’s KV cache.
- The router returns the worker most likely to serve the request from cache, balanced against current worker load.
- The frontend sends the request directly to that worker.
Configuration
Settings live underb10_routing_config. Defaults match production Model APIs and rarely need to change.
How queued requests are ordered when all workers are saturated. Queueing rarely triggers under normal load.
fcfs: First-come, first-served with priority bumps. Optimizes tail TTFT and provides fairness.wspt: Weighted shortest processing time. Prioritizes cheaper requests (high cache hit, short prompts). Risks starving costly requests; use when average TTFT matters more than tail TTFT.
Bias toward cache hits versus load balance. Higher values bias toward cache hits at the cost of balance; lower values bias toward balance at the cost of hits. Contact us before changing in production.
Randomness in worker selection. Higher values spread load across more workers; lower values concentrate hits on fewer workers. Contact us before changing in production.
When to use
KV-aware routing is on by default for BIS-LLM deployments and pays off whenever prompts share prefixes: agent loops, chat with long system messages, RAG pipelines reusing retrieved context, and code completion. Workloads with no prefix overlap (unique single-turn prompts) see only the load-balancing benefit.Monitoring
| Metric | What it measures | What to look for |
|---|---|---|
kv_cache_hit_rate | Actual KV cache hit rate observed by workers. | Baseline varies by model and traffic. Track changes over time, not absolute values. |
kv_cache_hit_rate_skew | Router’s estimated hit rate minus actual hit rate. | Typically slightly positive (~+10%). Large positive: high cache churn. Large negative: missed event stream. |
kv_cache_best_prefix_hit_rate | Best hit rate the router could have selected given its index. | Upper bound of routing quality for the current index. |
kv_cache_hit_rate_efficiency | Ratio of actual hit rate to best possible. | Typically 90-100%. Lower values mean the router is trading hits for balance. |
Disaggregated serving
In a standard deployment, each replica handles both prefill (prompt processing) and decode (token generation). When a long prompt arrives, the replica must finish prefill before it can decode any tokens, blocking shorter requests queued behind it.How it works
Disaggregated serving splits prefill and decode into separate replica groups:- Prefill replicas process input prompts and transfer the resulting KV cache to decode replicas.
- Decode replicas receive KV cache from prefill replicas and generate output tokens.
Configuration
Setis_disaggregated and b10_disagg_config in the llm_config block:
Enables disaggregated serving. Must be
true for b10_disagg_config to take effect. Setting b10_disagg_config without is_disaggregated: true fails validation.Prefill worker pods per replication unit. Must be an integer >= 1.
Decode worker pods per replication unit. Must be an integer >= 1.
prefill: 1, decode: 2 configuration means each unit has one prefill pod and two decode pods. The autoscaler scales the number of units, not individual pods.
The backend rejects deployments where is_disaggregated is false or absent but b10_disagg_config is set, and rejects deployments where is_disaggregated is true but either worker count is missing or less than one.
When to use
Disaggregated serving fits deployments with at least one of these traits:- Mismatched prefill and decode resource profiles. Long-context models (128K+ tokens) have compute-heavy prefills and memory-bound decodes. Separate scaling right-sizes each phase.
- Strict TTFT targets. Isolating prefill on dedicated replicas prevents decode requests from queuing behind long prompts.
- Variable prompt lengths. Mixed short/long workloads benefit more than uniform traffic.
Monitoring
Watch BIS-LLM autoscaling metrics on each replica group. Token-based autoscaling sizes prefill and decode independently using their own in-flight token counts.Speculative decoding
Speculative decoding accelerates inference by drafting several future tokens cheaply, then verifying them against the main model in a single forward pass. Accepted tokens advance the output; rejected tokens are discarded and the model resumes autoregressive decoding from the last accepted token.How it works
BIS-LLM speculative decoding uses a fast draft mechanism (a lightweight Eagle head, the model’s own MTP layers, or n-gram automata) to generate candidate tokens. The main model then verifies these candidates in a single batched forward pass. Higher acceptance rates yield more tokens per forward pass and lower latency. This is a different system from v1 lookahead decoding, which uses n-gram patterns within a single model and is configured viatrt_llm.build.speculator. The v2 stack rejects trt_llm.build.speculator; use speculative_config instead.
| Decoding type | How it works | Best for |
|---|---|---|
Eagle | Separate Eagle head drafts tokens from a hidden-state representation. | Models with trained Eagle checkpoints. |
MTP | The model’s own multi-token-prediction layers draft multiple tokens per step. | Models with MTP heads built in (DeepSeek-V3). |
NGram | N-gram automata predict tokens from pattern matching without model computation. | High-throughput workloads where latency matters more than acceptance rate. |
Configuration
Setspeculative_config in the llm_config block. The required fields depend on decoding_type.
Speculative strategy. One of
Eagle, MTP, or NGram (case-insensitive).Required when
decoding_type is Eagle. Path to the Eagle head weights directory. BDN mirrors this as a standalone weight volume, separate from the main model weights.Required when
decoding_type is MTP. Number of next-token prediction layers in the model architecture.Optional. Maximum number of tokens the draft proposes per step. Raise it for more aggressive speculation, lower it if acceptance is poor.
Optional,
Eagle only. Run the Eagle3 draft head and the target model as a single fused model. Set to true for Eagle3 checkpoints that support it.When to use
Pick by model architecture, not preference. UseMTP for DeepSeek-V3 and other models that ship MTP heads. Use Eagle when you have a trained Eagle head for the target model. Use NGram for high-throughput workloads where any acceleration helps and no draft model is available.
Monitoring
The BIS-LLM dashboard exposesspeculation_rate when speculative decoding is active: the percentage of draft tokens accepted by the main model.
- Above 80%: Draft is well-aligned with the main model. Speculation is effective.
- 40-80%: Some rejections. Consider tuning the draft model or switching decoding types.
- Below 40%: Speculation likely costs more than it saves. Disable it or reduce draft length.
Related
- BIS-LLM overview: Engine fundamentals and supported model families.
- BIS-LLM configuration: Truss
config.yamlreference for the build step. - Autoscaling BIS-LLM: Token-based autoscaling for prefill, decode, and aggregated replicas.
- Lookahead decoding (v1): N-gram speculation for Engine-Builder-LLM, when you need the v1 path.