Configure your TensorRT-LLM inference engine
config.yaml
, such as for this Llama 3.1 8B example:
trt_llm.build
base_model
decoder
- for CausalLM such as Llama/Mistral/Qwen3ForCausalLM
encoder
- for Bert/Roberta/LLamaForSequenceClassification
, sentence-transformer models, embedding models
Deprecated:llama
(decrecated, use decoder
)mistral
(decrecated, use decoder
)deepseek
(decrecated, use decoder
)qwen
(decrecated, use decoder
)whisper
(deprecated, part of a separate product line)checkpoint_repository
hf_access_token
or trt_llm_gcs_service_account
secrets with a valid service account json for HuggingFace or GCS, respectively.
checkpoint_repository.source
HF
(HuggingFace)GCS
(Google Cloud Storage)REMOTE_URL
A tarball containing your checkpoint. Important: the archive must unpack with all required files (e.g., config.json
) at the root level. For example, config.json
should be directly in the tarball, not nested under a subdirectory like model_name/config.json
.checkpoint_repository.repo
checkpoint_repository.revision
"main"
)
The specific model version to use. It can be a branch name, a tag name, or a commit id. This field is only applicable to HF (HuggingFace) models.
max_batch_size
256
)
Maximum number of input sequences to pass through the engine concurrently. Batch size and throughput share a direct relation, whereas batch size and single request latency share an indirect relation.
Tune this value according to your SLAs and latency budget.
max_seq_len
max_num_tokens
8192
)
Defines the maximum number of batched input tokens after padding is removed in each batch. Tuning this value more efficiently allocates memory to KV cache and executes more requests together.
max_prompt_embedding_table_size
0
)
Maximum prompt embedding table size for prompt tuning.
num_builder_gpus
auto
)
Number of GPUs to be used at build time, defaults to configured resource.accelerator
count – useful for FP8 quantization in particular, when more GPU memory is required at build time relative to memory usage at inference.
plugin_configuration
plugin_configuration.paged_kv_cache
True
)
Decompose KV cache into page blocks. Read more about what this does here.
plugin_configuration.use_paged_context_fmha
True
)
Utilize paged context for fused multihead attention. This configuration is necessary to enable KV cache reuse. Read more about this configuration here.
plugin_configuration.use_fp8_context_fmha
False
)
Utilize FP8 quantization for context fused multihead attention to accelerate attention. To use this configuration, also set plugin_configuration.use_paged_context_fmha
. Read more about when to enable this here.
quantization_type
no_quant
)
Quantization format with which to build the engine. Supported formats include:
no_quant
(meaning bf16)fp8
fp8_kv
smooth_quant
weights_int8
weights_kv_int8
weights_int4
weights_int4_kv_int8
strongly_typed
False
)
Whether to build the engine using strong typing, enabling TensorRT’s optimizer to statically infer intermediate tensor types which can speed up build time for some formats.
Weak typing enables the optimizer to elect tensor types, which may result in a faster runtime. For more information refer to TensorRT documentation here.
tensor_parallel_count
1
)
Tensor parallelism count. For more information refer to NVIDIA documentation here.
speculator
None
)
Config for inserting optional speculative decoding options.
speculator.build
.
For example, here is a sample configuration for utilizing speculative decoding for Llama-3-70B-Instruct:
speculator.speculative_decoding_mode
speculator.num_draft_tokens
speculator.checkpoint_repository
checkpoint_repository
for details.
speculator.lookahead_ngram_size
, speculator.lookahead_windows_size
, speculator.lookahead_verification_set_size
windows_size
is the Jacobi window size, meaning number of n-grams in lookahead branch that explores future draft tokens.ngram_size
is the n-gram size, meaning the maximum number of draft tokens accepted per iteration.verification_set_size
is the maximum number of n-grams considered for verification, meaning the number of draft token beam hypotheses.lookahead_ngram_size
is often increased when the generated tokens are simlar to contents of the prompt, and decreased if dissimilar.
lora_adapters
None
)
A mapping from LoRA names to checkpoint repositories.
For example,
checkpoint_repository
for details on how to configure checkpoint repositories.
In addition to specifying the LoRAs here, you need to specify the served_model_name
that is used to refer to the base model.
The served_model_name
is required for deploying LoRAs.
The LoRA name (in the example above, this is “lora1”) is used to query the model using the specified LoRA.
trt_llm.runtime
kv_cache_free_gpu_mem_fraction
0.9
)
Used to control the fraction of free gpu memory allocated for the KV cache. For more information, refer to the documentation here.
If you are using DRAFT_TOKENS_EXTERNAL, we recommend to lower this, depending on the draft model size.
enable_chunked_context
False
)
Enables chunked context, increasing the chance of batch processing between context and generation phase – which may be useful to increase throughput.
Note that one must set plugin_configuration.use_paged_context_fmha: True
in order to leverage this feature.
batch_scheduler_policy
guaranteed_no_evict
)
Supported scheduler policies are as follows:
guaranteed_no_evict
max_utilization
guaranteed_no_evict
ensures that an in progress request is never evicted by reserving KV cache space for the maximum possible tokens that can be returned for a request.
max_utilization
packs as many requests as possible during scheduling, which may increase throughput at the expense of additional latency.
For more information refer to the NVIDIA documentation here.
request_default_max_tokens
None
)
Default server configuration for the maximum number of tokens to generate for a single sequence, if one is not provided in the request body.
Sensible settings depend on your use case, a general value to set can be around 1000 tokens.
served_model_name
None
)
The name used to refer to the base model when using LoRAs.
At least one LoRA must be specified under lora_adapters
to use LoRAs.