trt_llm

Configuration for TRT-LLM accelerated model services. Here’s a TinyLlama example to get started quickly:

model_name: TinyLlama-1.1B-Chat TRT-LLM
resources:
  accelerator: A10G
  use_gpu: True
trt_llm:
  build:
    max_input_len: 1000
    max_batch_size: 1
    max_beam_width: 1
    max_output_len: 500
    strongly_typed: True
    base_model: llama
    checkpoint_repository:
      repo: TinyLlama/TinyLlama-1.1B-Chat-v1.0
      source: HF

build

TRT-LLM engine build configuration. TensorRT-LLM attempts to build a highly optimized network based on input shapes representative of your workload.

base_model

The base model architecture of your model checkpoint. Supported architectures include:

  • llama
  • mistral
  • whisper

max_input_len

Maximum input length in tokens to limit prompts to. This value should act as an minimal upper bound for input length.

max_output_len

Maximum output length in tokens to limit outputs to. This value should act as an minimal upper bound for output length.

max_batch_size

Maximum number of input sequences to pass through the engine concurrently. Batch size and throughput share a direct relation, whereas batch size and single request latency share an indirect relation. Tune this value according to your SLAs and latency budget.

max_beam_width

Maximum number of candidate sequences with which to conduct beam search. This value should act as an minimal upper bound for beam candidates.

max_prompt_embedding_table_size

(default: 0)

Maximum prompt embedding table size for prompt tuning.

checkpoint_repository

Specification of the model checkpoint to be leveraged for engine building. E.g.

checkpoint_repository:
    source: HF | GCS | REMOTE_URL
    repo: meta-llama/Meta-Llama-3-8B-Instruct | gs://bucket_name | https://your-checkpoint.com

To configure access to private model checkpoints, register secrets in your Baseten workspace, namely the hf_access_token or trt_llm_gcs_service_account secrets with a valid service account json for HuggingFace or GCS, respectively. Ensure that you push your truss with the --trusted flag to enable access to your secrets.

checkpoint_repository.source

Source where the checkpoint is stored. Supported sources include:

  • HF (HuggingFace)
  • GCS (Google Cloud Storage)
  • REMOTE_URL

checkpoint_repository.repo

Checkpoint repository name, bucket, or url.

strongly_typed

(default: False)

Whether to build the engine using strong typing, enabling TensorRT’s optimizer to statically infer intermediate tensor types which can speed up build time for some formats. Weak typing enables the optimizer to elect tensor types, which may result in a faster runtime. For more information refer to TensorRT documentation here.

quantization_type

(default: no_quant)

Quantization format with which to build the engine. Supported formats include:

  • no_quant
  • weights_int8
  • fp8
  • fp8_kv

Read more about different post training quantization techniques supported by TRT-LLM here. Additionally, refer to the hardware and quantization technique support matrix.

tensor_parallel_count

(default: 1)

Tensor parallelism count. For more information refer to NVIDIA documentation here.

plugin_configuration

Config for inserting plugin nodes into network graph definition for execution of user-defined kernels.

plugin_configuration.multi_block_mode

(default: False)

Distribute masked MHA kernel work across multiple CUDA thread blocks in scenarios with low GPU occupancy. Read more about when to enable this plugin here.

plugin_configuration.paged_kv_cache

(default: True)

Decompose KV cache into page blocks. Read more about what this does here.

plugin_configuration.gemm_plugin

(default: float16)

Utilize NVIDIA cuBLASLt for GEMM ops. Read more about when to enable this here.

use_fused_mlp

(default: False)

Enables GEMM horizontal fusion in gated MLP layer, potentially improving performance.

kv_cache_free_gpu_mem_fraction

(default: 0.9)

Used to control the fraction of free gpu memory allocated for the KV cache. For more information, refer to the documentation here.

num_builder_gpus

(default: auto)

Number of GPUs to be used at build time, defaults to configured resource.accelerator count – useful for FP8 quantization in particular, when more GPU memory is required at build time relative to memory usage at inference.