Speculative decoding examples

Speculative decoding with lookahead decoding accelerates inference for predictable workloads using n-gram patterns.

Quick start

trt_llm:
  build:
    speculator:
      enable_b10_lookahead: true
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 8
      lookahead_ngram_size: 1
      lookahead_verification_set_size: 1

Engine compatibility

Feature	Engine-Builder-LLM	BIS-LLM
Lookahead decoding	✅ Supported	✅ Gated Feature
Structured outputs	❌ Incompatible	✅ Supported
Tool calling	❌ Incompatible	✅ Supported
Eagle speculation	❌ Not supported	✅ Gated Feature

Configuration examples

Code generation (Qwen2.5-Coder)

model_name: Qwen2.5-Coder-7B-Lookahead
resources:
  accelerator: H100
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen2.5-Coder-7B-Instruct"
    quantization_type: fp8_kv
    speculator:
      enable_b10_lookahead: true
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 3
      lookahead_ngram_size: 8
      lookahead_verification_set_size: 3

Large model (Llama-3.3-70B)

model_name: Llama-3.3-70B-Lookahead
resources:
  accelerator: H100:2
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "meta-llama/Llama-3.3-70B-Instruct"
    quantization_type: fp8_kv
    tensor_parallel_count: 2
    speculator:
      enable_b10_lookahead: true
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 3
      lookahead_ngram_size: 5
      lookahead_verification_set_size: 3

Parameter tuning

See lookahead decoding documentation for detailed parameter explanations. Quick guidelines:

lookahead_windows_size: 1-7 (set to 1 for predictable content, 3 or 5 for others.)
lookahead_ngram_size: 4-32 (large for code, smaller for creative tasks)
lookahead_verification_set_size: Usually equal to lookahead_windows_size

Use cases

Use case	lookahead_windows_size	lookahead_ngram_size	Why
Code generation	7	3	Code patterns, smaller n-grams
free form JSON/YAML	5	5	Balanced for structured data
Template completion	7-10	5-7	Highly predictable content

Limitations

❌ Not compatible with:

Structured outputs - Use BIS-LLM instead
Function calling - Use BIS-LLM instead
BIS-LLM engine - V2 stack doesn’t support lookahead that is self-serviceable.

Lookahead decoding guide - Complete reference config.
Engine-Builder-LLM overview - Dense model engine.
BIS-LLM overview - MoE engine with structured outputs.
Quantization guide - Performance optimization.

Examples

​Quick start

​Engine compatibility

​Configuration examples

​Code generation (Qwen2.5-Coder)

​Large model (Llama-3.3-70B)

​Parameter tuning

​Use cases

​Limitations

​Related