Engine builder configuration

This reference lists every configuration option for the TensorRT-LLM Engine Builder. These options are used in config.yaml, such as for this Llama 3.1 8B example:

config.yaml

model_name: Llama 3.1 8B Engine
resources:
  accelerator: H100:1
secrets:
  hf_access_token: "set token in baseten workspace"
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      repo: meta-llama/Llama-3.1-8B-Instruct
      source: HF

`trt_llm.build`

TRT-LLM engine build configuration. TensorRT-LLM attempts to build a highly optimized network based on input shapes representative of your workload.

`base_model`

The base model architecture of your model checkpoint. Supported architectures include:

decoder - for CausalLM such as Llama/Mistral/Qwen3ForCausalLM
encoder - for Bert/Roberta/LLamaForSequenceClassification, sentence-transformer models, embedding models Deprecated:
llama (decrecated, use decoder)
mistral (decrecated, use decoder)
deepseek (decrecated, use decoder)
qwen (decrecated, use decoder)
whisper (deprecated, part of a separate product line)

`checkpoint_repository`

Specification of the model checkpoint to be leveraged for engine building. E.g.

checkpoint_repository:
  source: HF | GCS | REMOTE_URL
  repo: meta-llama/Llama-3.1-8B-Instruct | gs://bucket_name | https://your-checkpoint.com/model.tar.gz
  revision: main  # Optional, only applicable to HF models

To configure access to private model checkpoints, register secrets in your Baseten workspace, namely the hf_access_token or trt_llm_gcs_service_account secrets with a valid service account json for HuggingFace or GCS, respectively.

`checkpoint_repository.source`

Source where the checkpoint is stored. This should contain assets as if using git clone with lfs for a Hugging Face repository. This includes the tokenizer files, remote code and safetensor files and any json file related to configuration. For GCS/REMOTE_URL, we require the files to be organized in the same folder structured to a huggingface transformers repository. Supported sources include:

HF (HuggingFace)
GCS (Google Cloud Storage)
REMOTE_URL A tarball containing your checkpoint. Important: the archive must unpack with all required files (e.g., config.json) at the root level. For example, config.json should be directly in the tarball, not nested under a subdirectory like model_name/config.json.

`checkpoint_repository.repo`

Checkpoint repository name, bucket, or url.

`checkpoint_repository.revision`

(default: "main") The specific model version to use. It can be a branch name, a tag name, or a commit id. This field is only applicable to HF (HuggingFace) models.

`max_batch_size`

(default: 256) Maximum number of input sequences to pass through the engine concurrently. Batch size and throughput share a direct relation, whereas batch size and single request latency share an indirect relation. Tune this value according to your SLAs and latency budget.

`max_seq_len`

(default: max_position_embeddings from the model repo) Defines the maximum sequence length (context) of single request.

`max_num_tokens`

(default: 8192) Defines the maximum number of batched input tokens after padding is removed in each batch. Tuning this value more efficiently allocates memory to KV cache and executes more requests together.

`max_prompt_embedding_table_size`

(default: 0) Maximum prompt embedding table size for prompt tuning.

`num_builder_gpus`

(default: auto) Number of GPUs to be used at build time, defaults to configured resource.accelerator count – useful for FP8 quantization in particular, when more GPU memory is required at build time relative to memory usage at inference.

`plugin_configuration`

Config for inserting plugin nodes into network graph definition for execution of user-defined kernels.

`plugin_configuration.paged_kv_cache`

(default: True) Decompose KV cache into page blocks. Read more about what this does here.

`plugin_configuration.use_paged_context_fmha`

(default: True) Utilize paged context for fused multihead attention. This configuration is necessary to enable KV cache reuse. Read more about this configuration here.

`plugin_configuration.use_fp8_context_fmha`

(default: False) Utilize FP8 quantization for context fused multihead attention to accelerate attention. To use this configuration, also set plugin_configuration.use_paged_context_fmha. Read more about when to enable this here.

`quantization_type`

(default: no_quant) Quantization format with which to build the engine. Supported formats include:

no_quant (meaning bf16)
fp8
fp8_kv

The following quantization

smooth_quant
weights_int8
weights_kv_int8
weights_int4
weights_int4_kv_int8

Read more about different post training quantization techniques supported by TRT-LLM here. Additionally, refer to the hardware and quantization technique support matrix.

`strongly_typed`

(default: False) Whether to build the engine using strong typing, enabling TensorRT’s optimizer to statically infer intermediate tensor types which can speed up build time for some formats. Weak typing enables the optimizer to elect tensor types, which may result in a faster runtime. For more information refer to TensorRT documentation here.

`tensor_parallel_count`

(default: 1) Tensor parallelism count. For more information refer to NVIDIA documentation here.

`speculator`

(default: None) Config for inserting optional speculative decoding options.

Speculation with lookahead decoding

Speculation with lookahead decoding can be used by any model and does not require training. The implemenation is based on the work at lmsys. We currently disallow performing structured generation and tool-calling with this optimization.

model_name: Llama-3.1-8B-Instruct (lookahead decoding)
resources:
  accelerator: H100
  use_gpu: true
trt_llm:
  build:
    base_model: llama
    checkpoint_repository:
      repo: meta-llama/Llama-3.1-8B-Instruct
      source: HF
    max_batch_size: 32
    quantization_type: fp8_kv
    speculator:
      speculative_decoding_mode: LOOKAHEAD_DECODING
      windows_size: 7
      ngram_size: 5
      verification_set_size: 7
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.62

Speculation with external draft model

Speculative decoding with draft models (e.g., using DRAFT_TOKENS_EXTERNAL) is an advanced feature that requires careful GPU memory allocation to accommodate both models simultaneously. If you plan to use this technique, consider consulting the Baseten support team for guidance on optimal configuration.

Speculative draft model configuration to be used for speculative decoding. By default, the speculator build will attempt to reuse as much of the target model build configuration. To fully specify your own speculator build, define speculator.build. For example, here is a sample configuration for utilizing speculative decoding for Llama-3-70B-Instruct:

model_name: Llama-3.1-70B-Instruct (External Token Spec-Dec)
resources:
  accelerator: H100
  use_gpu: true
trt_llm:
  build:
    base_model: qwen
    checkpoint_repository:
      repo: meta-llama/Llama-3.3-8B-Instruct
      source: HF
    max_batch_size: 32
    quantization_type: fp8_kv
    tensor_parallel_count: 2
    speculator:
      speculative_decoding_mode: DRAFT_TOKENS_EXTERNAL
      checkpoint_repository:
        repo: meta-llama/Llama-3.2-1B-Instruct
        source: HF
      num_draft_tokens: 4
  runtime:
    kv_cache_free_gpu_mem_fraction: 0.62

`speculator.speculative_decoding_mode`

The type of speculative decoding tactic. Supported are:

“DRAFT_TOKENS_EXTERNAL”
“LOOKAHEAD_DECODING” (recommend)

`speculator.num_draft_tokens`

Number of draft tokens to sample from the speculative model. This depends on how many tokens are expected to be accepted, a good range of values to start with are between 2-8. Automatically calculated field for lookahead decoding.

`speculator.checkpoint_repository`

See checkpoint_repository for details.

`speculator.lookahead_ngram_size`, `speculator.lookahead_windows_size`, `speculator.lookahead_verification_set_size`

Usage of ngram size, window size, verification_set_size in the lookahead algorithm.

windows_size is the Jacobi window size, meaning number of n-grams in lookahead branch that explores future draft tokens.
ngram_size is the n-gram size, meaning the maximum number of draft tokens accepted per iteration.
verification_set_size is the maximum number of n-grams considered for verification, meaning the number of draft token beam hypotheses.

A good default value could be [5,5,5]. Often, lookahead_verification_set_size is set to lookahead_windows_size. lookahead_ngram_size is often increased when the generated tokens are simlar to contents of the prompt, and decreased if dissimilar.

`lora_adapters`

(default: None) A mapping from LoRA names to checkpoint repositories. For example,

checkpoint_repository:
  repo: meta-llama/Llama-2-13b-hf
  source: HF
lora_adapters:
  lora1:
    repo: hfl/chinese-llama-2-lora-13b
    source: HF

See checkpoint_repository for details on how to configure checkpoint repositories. In addition to specifying the LoRAs here, you need to specify the served_model_name that is used to refer to the base model. The served_model_name is required for deploying LoRAs. The LoRA name (in the example above, this is “lora1”) is used to query the model using the specified LoRA.

`trt_llm.runtime`

TRT-LLM engine runtime configuration.

`kv_cache_free_gpu_mem_fraction`

(default: 0.9) Used to control the fraction of free gpu memory allocated for the KV cache. For more information, refer to the documentation here. If you are using DRAFT_TOKENS_EXTERNAL, we recommend to lower this, depending on the draft model size.

`enable_chunked_context`

(default: False) Enables chunked context, increasing the chance of batch processing between context and generation phase – which may be useful to increase throughput. Note that one must set plugin_configuration.use_paged_context_fmha: True in order to leverage this feature.

`batch_scheduler_policy`

(default: guaranteed_no_evict) Supported scheduler policies are as follows:

guaranteed_no_evict
max_utilization

guaranteed_no_evict ensures that an in progress request is never evicted by reserving KV cache space for the maximum possible tokens that can be returned for a request. max_utilization packs as many requests as possible during scheduling, which may increase throughput at the expense of additional latency. For more information refer to the NVIDIA documentation here.

`request_default_max_tokens`

(default: None) Default server configuration for the maximum number of tokens to generate for a single sequence, if one is not provided in the request body. Sensible settings depend on your use case, a general value to set can be around 1000 tokens.

`served_model_name`

(default: None) The name used to refer to the base model when using LoRAs. At least one LoRA must be specified under lora_adapters to use LoRAs.

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

​trt_llm.build

​base_model

​checkpoint_repository

​checkpoint_repository.source

​checkpoint_repository.repo

​checkpoint_repository.revision

​max_batch_size

​max_seq_len

​max_num_tokens

​max_prompt_embedding_table_size

​num_builder_gpus

​plugin_configuration

​plugin_configuration.paged_kv_cache

​plugin_configuration.use_paged_context_fmha

​plugin_configuration.use_fp8_context_fmha

​quantization_type

​strongly_typed

​tensor_parallel_count

​speculator

​Speculation with lookahead decoding

​Speculation with external draft model

​speculator.speculative_decoding_mode

​speculator.num_draft_tokens

​speculator.checkpoint_repository

​speculator.lookahead_ngram_size, speculator.lookahead_windows_size, speculator.lookahead_verification_set_size

​lora_adapters

​trt_llm.runtime

​kv_cache_free_gpu_mem_fraction

​enable_chunked_context

​batch_scheduler_policy

​request_default_max_tokens

​served_model_name

`trt_llm.build`

`base_model`

`checkpoint_repository`

`checkpoint_repository.source`

`checkpoint_repository.repo`

`checkpoint_repository.revision`

`max_batch_size`

`max_seq_len`

`max_num_tokens`

`max_prompt_embedding_table_size`

`num_builder_gpus`

`plugin_configuration`

`plugin_configuration.paged_kv_cache`

`plugin_configuration.use_paged_context_fmha`

`plugin_configuration.use_fp8_context_fmha`

`quantization_type`

`strongly_typed`

`tensor_parallel_count`

`speculator`

Speculation with lookahead decoding

Speculation with external draft model

`speculator.speculative_decoding_mode`

`speculator.num_draft_tokens`

`speculator.checkpoint_repository`

`speculator.lookahead_ngram_size`, `speculator.lookahead_windows_size`, `speculator.lookahead_verification_set_size`

`lora_adapters`

`trt_llm.runtime`

`kv_cache_free_gpu_mem_fraction`

`enable_chunked_context`

`batch_scheduler_policy`

`request_default_max_tokens`

`served_model_name`