trt_llm section in config.yaml.
Configuration structure
Build configuration
Thebuild section configures model compilation and optimization settings.
The base model architecture for your model checkpoint.Options:
decoder: For CausalLM models (Llama, Mistral, Qwen, Gemma, Phi)
Specifies where to find the model checkpoint. Repository must be a valid Hugging Face model repository with the standard structure (config.json, tokenizer files, model weights).Source options:
HF: Hugging Face Hub (default)GCS: Google Cloud StorageS3: AWS S3AZURE: Azure Blob StorageREMOTE_URL: HTTP URL to tar.gz fileBASETEN_TRAINING: Baseten Training checkpoints
Maximum sequence length (context) for single requests. Range: 1 to 1048576.
Maximum number of input sequences processed concurrently. Range: 1 to 2048.Unless lookahead decoding is enabled, this parameter has little effect on performance. Keep it at 256 for most cases.
Maximum number of batched input tokens after padding removal in each batch. Range: 256 to 131072, must be multiple of 64.If
enable_chunked_prefill: false, this also limits the max_seq_len that can be processed. Recommended: 8192 or 16384.Specifies the quantization format for model weights.Options:
no_quant:FP16/BF16precisionfp8:FP8weights + 16-bit KV cachefp8_kv:FP8weights +FP8KV cachefp4:FP4weights + 16-bit KV cache (B200 only)fp4_kv:FP4weights +FP8KV cache (B200 only)fp4_mlp_only:FP4MLP only + 16-bit KV (B200 only)
Configuration for post-training quantization calibration.Fields:
calib_size: Size of calibration dataset (64-16384, multiple of 64). Defines how many rows of the train split with text column to take.calib_dataset: HuggingFace dataset for calibration. Dataset must have βtextβ column (str type) for samples, or βtrainβ split as subsection.calib_max_seq_length: Maximum sequence length for calibration.
Number of GPUs to use for tensor parallelism. Range: 1 to 8.
TensorRT-LLM plugin configuration for performance optimization.Fields:
paged_kv_cache: Enable paged KV cache (recommended: true)use_paged_context_fmha: Enable paged context FMHA (recommended: true)use_fp8_context_fmha: EnableFP8context FMHA (requiresFP8_KVquantization)
Configuration for speculative decoding with lookahead. For detailed configuration, see Lookahead decoding.Fields:
speculative_decoding_mode:LOOKAHEAD_DECODING(recommended)lookahead_windows_size: Window size for speculation (1-8)lookahead_ngram_size: N-gram size for patterns (1-16)lookahead_verification_set_size: Verification buffer size (1-8)enable_b10_lookahead: Enable Basetenβs lookahead algorithm
Number of GPUs to use during the build job. Only set this if you encounter errors during the build job. It has no impact once the model reaches the deploying stage. If not set, equals
tensor_parallel_count.Runtime configuration
Theruntime section configures inference engine behavior.
Fraction of GPU memory to reserve for KV cache. Range: 0.1 to 1.0.
Enable chunked prefilling for long sequences.
Policy for scheduling requests in batches.Options:
max_utilization: Maximize GPU utilization (may evict requests)guaranteed_no_evict: Guarantee request completion (recommended)
Model name returned in API responses.
Maximum number of tokens that can be scheduled at once. Range: 1 to 1000000.
Configuration examples
Llama 3.3 70B
Qwen 2.5 32B with lookahead decoding
Small model on L4
B200 with FP4 quantization
Validation and troubleshooting
Common errors
Error:FP8 quantization is only supported on L4, H100, H200, B200
- Cause: Using
FP8quantization on unsupported GPU. - Fix: Use H100 or newer GPU, or use
no_quant.
FP4 quantization is only supported on B200
- Cause: Using
FP4quantization on unsupported GPU. - Fix: Use B200 GPU or
FP8quantization.
Using fp8 context fmha requires fp8 kv, or fp4 with kv cache dtype
- Cause: Mismatch between quantization and context FMHA settings.
- Fix: Use
fp8_kvquantization or disableuse_fp8_context_fmha.
Tensor parallelism and GPU count must be the same
- Cause: Mismatch between
tensor_parallel_countand GPU count. - Fix: Ensure
tensor_parallel_countmatchesacceleratorcount.
Performance tuning
For lowest latency:- Reduce
max_batch_sizeandmax_num_tokens. - Use
batch_scheduler_policy: guaranteed_no_evict. - Consider smaller models or quantization.
- Increase
max_batch_sizeandmax_num_tokens. - Use
batch_scheduler_policy: max_utilization. - Enable quantization on supported hardware.
- Use L4 GPUs with
FP8quantization. - Choose appropriately sized models.
- Tune
max_seq_lento your actual requirements.
Model repository structure
All model sources (S3, GCS, HuggingFace, or tar.gz) must follow the standard HuggingFace repository structure. Files must be in the root directory, similar to running:Required files
Model configuration (config.json):
max_position_embeddings: Limits maximum context size (content beyond this is truncated).vocab_size: Vocabulary size for the model.architectures: Must includeLlamaForCausalLM,MistralForCausalLM, or similar causal LM architectures. Custom code is typically not read.torch_dtype: Default inference dtype (float16orbfloat16). Cannot be a pre-quantized model.
model.safetensors):
- Or:
model.safetensors.index.json+model-xx-of-yy.safetensors(sharded). - Convert to safetensors if you encounter issues with other formats.
- Cannot be a pre-quantized model. Model must be an
fp16,bf16, orfp32checkpoint.
tokenizer_config.json and tokenizer.json):
- For maximum compatibility, use βFASTβ tokenizers compatible with Rust.
- Cannot contain custom Python code.
- For chat completions: must contain
chat_template, a Jinja2 template.
Architecture support
| Model family | Supported architectures | Notes |
|---|---|---|
| Llama | LlamaForCausalLM | Full support for Llama 3. For Llama 4, use BIS-LLM. |
| Mistral | MistralForCausalLM | Including v0.3 and Small variants. |
| Qwen | Qwen2ForCausalLM, Qwen3ForCausalLM | Including Qwen 2.5 and Qwen 3 series. |
| Gemma | GemmaForCausalLM | Including Gemma 2 and Gemma 3 series, bf16 only. |
Best practices
Model size and GPU selection
| Model size | Recommended GPU | Quantization | Tensor parallel |
|---|---|---|---|
<8B | L4/H100 | FP8_KV | 1 |
| 8B-70B | H100 | FP8_KV | 1-2 |
| 70B+ | H100/B200 | FP8_KV/FP4 | 4+ |
Production recommendations
- Use
quantization_type: fp8_kvfor best performance/accuracy balance. - Set
max_batch_sizebased on your expected traffic patterns. - Enable
paged_kv_cacheanduse_paged_context_fmhafor optimal performance.
Development recommendations
- Use
quantization_type: no_quantfor fastest iteration. - Set smaller
max_seq_lento reduce build time. - Use
batch_scheduler_policy: guaranteed_no_evictfor predictable behavior.