Speculative decoding with lookahead decoding accelerates inference for predictable workloads using n-gram patterns.
Quick start
trt_llm:
build:
speculator:
enable_b10_lookahead: true
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 8
lookahead_ngram_size: 1
lookahead_verification_set_size: 1
Engine compatibility with lookahead decoding
| Feature | Engine-Builder-LLM | BIS-LLM |
|---|
| Lookahead decoding | ✅ Supported | ✅ Gated Feature |
| Structured outputs | ❌ Incompatible | ✅ Supported |
| Tool calling | ❌ Incompatible | ✅ Supported |
| Eagle speculation | ❌ Not supported | ✅ Gated Feature |
The incompatibilities above apply only when lookahead decoding is enabled. Engine-Builder-LLM supports structured outputs and tool calling in standard (non-speculative) deployments.
Configuration examples
Code generation (Qwen2.5-Coder)
model_name: Qwen2.5-Coder-7B-Lookahead
resources:
accelerator: H100
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "Qwen/Qwen2.5-Coder-7B-Instruct"
quantization_type: fp8_kv
speculator:
enable_b10_lookahead: true
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 3
lookahead_ngram_size: 8
lookahead_verification_set_size: 3
Large model (Llama-3.3-70B)
model_name: Llama-3.3-70B-Lookahead
resources:
accelerator: H100:2
trt_llm:
build:
base_model: decoder
checkpoint_repository:
source: HF
repo: "meta-llama/Llama-3.3-70B-Instruct"
quantization_type: fp8_kv
tensor_parallel_count: 2
speculator:
enable_b10_lookahead: true
speculative_decoding_mode: LOOKAHEAD_DECODING
lookahead_windows_size: 3
lookahead_ngram_size: 5
lookahead_verification_set_size: 3
Parameter tuning
See lookahead decoding documentation for detailed parameter explanations.
Quick guidelines:
- lookahead_windows_size: 1-7 (set to 1 for predictable content, 3 or 5 for others.)
- lookahead_ngram_size: 4-32 (large for code, smaller for creative tasks)
- lookahead_verification_set_size: Usually equal to lookahead_windows_size
Use cases
| Use case | lookahead_windows_size | lookahead_ngram_size | Why |
|---|
| Code generation | 7 | 3 | Code patterns, smaller n-grams |
| free form JSON/YAML | 5 | 5 | Balanced for structured data |
| Template completion | 7-10 | 5-7 | Highly predictable content |
Limitations
❌ Not compatible with lookahead decoding:
- Structured outputs - Use BIS-LLM for deployments that require both speculative decoding and structured outputs
- Function calling - Use BIS-LLM for deployments that require both speculative decoding and tool calling
- BIS-LLM engine - V2 stack doesn’t support lookahead that is self-serviceable.