This reference lists every configuration option for the TensorRT-LLM Engine Builder. These options are used in config.yaml, such as for this Llama 3.1 8B example:

config.yaml
model_name: Llama 3.1 8B Engine
resources:
  accelerator: H100:1
secrets:
  hf_access_token: "set token in baseten workspace"
trt_llm:
  build:
    base_model: llama
    checkpoint_repository:
      repo: meta-llama/Llama-3-8B-Instruct
      source: HF
    max_seq_len: 8000

trt_llm.build

TRT-LLM engine build configuration. TensorRT-LLM attempts to build a highly optimized network based on input shapes representative of your workload.

base_model

The base model architecture of your model checkpoint. Supported architectures include:

  • llama
  • mistral
  • deepseek
  • qwen

checkpoint_repository

Specification of the model checkpoint to be leveraged for engine building. E.g.

checkpoint_repository:
    source: HF | GCS | REMOTE_URL
    repo: meta-llama/Llama-3.1-8B-Instruct | gs://bucket_name | https://your-checkpoint.com

To configure access to private model checkpoints, register secrets in your Baseten workspace, namely the hf_access_token or trt_llm_gcs_service_account secrets with a valid service account json for HuggingFace or GCS, respectively. Ensure that you push your truss with the --trusted flag to enable access to your secrets.

checkpoint_repository.source

Source where the checkpoint is stored. Supported sources include:

  • HF (HuggingFace)
  • GCS (Google Cloud Storage)
  • REMOTE_URL

checkpoint_repository.repo

Checkpoint repository name, bucket, or url.

max_batch_size

(default: 256)

Maximum number of input sequences to pass through the engine concurrently. Batch size and throughput share a direct relation, whereas batch size and single request latency share an indirect relation. Tune this value according to your SLAs and latency budget.

max_beam_width

(default: 1)

Maximum number of candidate sequences with which to conduct beam search. This value should act as an minimal upper bound for beam candidates.

Currently, only a beam width of 1 is supported.

max_seq_len

Defines the maximum sequence length (context) of single request​.

max_num_tokens

(default: 8192)

Defines the maximum number of batched input tokens after padding is removed in each batch. Tuning this value more efficiently allocates memory to KV cache and executes more requests together.

max_prompt_embedding_table_size

(default: 0)

Maximum prompt embedding table size for prompt tuning.

num_builder_gpus

(default: auto)

Number of GPUs to be used at build time, defaults to configured resource.accelerator count – useful for FP8 quantization in particular, when more GPU memory is required at build time relative to memory usage at inference.

plugin_configuration

Config for inserting plugin nodes into network graph definition for execution of user-defined kernels.

plugin_configuration.paged_kv_cache

(default: True)

Decompose KV cache into page blocks. Read more about what this does here.

plugin_configuration.gemm_plugin

(default: auto)

Utilize NVIDIA cuBLASLt for GEMM ops. Read more about when to enable this here.

plugin_configuration.use_paged_context_fmha

(default: False)

Utilize paged context for fused multihead attention. This configuration is necessary to enable KV cache reuse. Read more about this configuration here.

plugin_configuration.use_fp8_context_fmha

(default: False)

Utilize FP8 quantization for context fused multihead attention to accelerate attention. To use this configuration, also set plugin_configuration.use_paged_context_fmha. Read more about when to enable this here.

quantization_type

(default: no_quant)

Quantization format with which to build the engine. Supported formats include:

  • no_quant (meaning fp16)
  • weights_int8
  • weights_kv_int8
  • weights_int4
  • weights_int4_kv_int8
  • smooth_quant
  • fp8
  • fp8_kv

Read more about different post training quantization techniques supported by TRT-LLM here. Additionally, refer to the hardware and quantization technique support matrix.

strongly_typed

(default: False)

Whether to build the engine using strong typing, enabling TensorRT’s optimizer to statically infer intermediate tensor types which can speed up build time for some formats. Weak typing enables the optimizer to elect tensor types, which may result in a faster runtime. For more information refer to TensorRT documentation here.

tensor_parallel_count

(default: 1)

Tensor parallelism count. For more information refer to NVIDIA documentation here.

speculator

(default: None)

Speculative draft model configuration to be used for speculative decoding. By default, the speculator build will attempt to reuse as much of the target model build configuration. To fully specify your own speculator build, define speculator.build.

For example, here is a sample configuration for utilizing speculative decoding for Qwen2.5-Coder-14B:

model_metadata:
  tags:
  - openai-compatible
model_name: Qwen2.5-Coder-14B-Instruct (SpecDec)
resources:
  accelerator: H100
  cpu: '1'
  memory: 24Gi
  use_gpu: true
trt_llm:
  build:
    base_model: qwen 
    checkpoint_repository:
      repo: Qwen/Qwen2.5-Coder-14B-Instruct
      source: HF
    max_seq_len: 10000
    plugin_configuration:
      paged_kv_cache: true
      use_paged_context_fmha: true
    speculator:
      speculative_decoding_mode: DRAFT_TOKENS_EXTERNAL
      checkpoint_repository:
          repo: Qwen/Qwen2.5-Coder-0.5B-Instruct
          source: HF
      num_draft_tokens: 4
  runtime:
    enable_chunked_context: true
    kv_cache_free_gpu_mem_fraction: 0.62
    request_default_max_tokens: 1000
    total_token_limit: 500000

speculator.speculative_decoding_mode

The type of speculative decoding tactic. Currently support is limited to DRAFT_TOKENS_EXTERNAL.

speculator.num_draft_tokens

Number of draft tokens to sample from the speculative model. This depends on how many tokens are expected to be accepted, a good range of values to start with are between 2-8.

speculator.checkpoint_repository

See checkpoint_repository for details.

speculator.build

(default: None)

See build for details.

speculator.runtime

(default: None)

See trt_llm.runtime for details.

trt_llm.runtime

Runtime configuration for the built engine.

kv_cache_free_gpu_mem_fraction

(default: 0.9)

Used to control the fraction of free gpu memory allocated for the KV cache. For more information, refer to the documentation here.

enable_chunked_context

(default: False)

Enables chunked context, increasing the chance of batch processing between context and generation phase – which may be useful to increase throughput. Note that one must set plugin_configuration.use_paged_context_fmha: True in order to leverage this feature.

batch_scheduler_policy

(default: guaranteed_no_evict)

Supported scheduler policies are as follows:

  • guaranteed_no_evict
  • max_utilization

guaranteed_no_evict ensures that an in progress request is never evicted by reserving KV cache space for the maximum possible tokens that can be returned for a request. max_utilization packs as many requests as possible during scheduling, which may increase throughput at the expense of additional latency. For more information refer to the NVIDIA documentation here.

request_default_max_tokens

(default: None)

Default server configuration for the maximum number of tokens to generate for a single sequence, if one is not provided in the request body. Sensible settings depend on your use case, a general value to set can be around 1000 tokens.