trt_llm section in config.yaml.
Configuration structure
Build configuration
Thebuild section configures model compilation and optimization settings.
The base model architecture determines which BEI variant to use.Options:
encoder: BEI - for causal embedding models (Llama, Mistral, Qwen, Gemma)encoder_bert: BEI-Bert - for BERT-based models (BERT, RoBERTa, Jina, Nomic)
Specifies where to find the model checkpoint. Repository must follow the standard HuggingFace structure.Source options:
HF: Hugging Face Hub (default)GCS: Google Cloud StorageS3: AWS S3AZURE: Azure Blob StorageREMOTE_URL: HTTP URL to tar.gz fileBASETEN_TRAINING: Baseten Training checkpoints
Maximum number of tokens that can be processed in a single batch. BEI and BEI-Bert run without chunked-prefill for performance reasons. This limits the effective context length to the
max_position_embeddings value.Range: 64 to 131072, must be multiple of 64. Use higher values (up to 131072) for long context models. Most models use 16384 as default.Not supported for BEI engines. Leave this value unset. BEI automatically sets it and truncates if context length is exceeded.
Specifies the quantization format for model weights.
FP8 quantization maintains accuracy within 1% of FP16 for embedding models.Options for BEI:no_quant:FP16/“ precisionfp8:FP8weights + 16-bit KV cachefp4:FP4weights + 16-bit KV cache (B200 only)fp4_mlp_only:FP4MLP weights only (B200 only)
no_quant:FP16precision (only option)
Configuration for post-training quantization calibration.Fields:
calib_size: Size of calibration dataset (64-16384, multiple of 64)calib_dataset: HuggingFace dataset for calibrationcalib_max_seq_length: Maximum sequence length for calibration
BEI automatically configures optimal TensorRT-LLM plugin settings. Manual configuration is not required or supported.Automatic optimizations:
- XQA kernels for maximum throughput
- Dynamic batching for optimal utilization
- Memory-efficient attention mechanisms
- Hardware-specific optimizations
Runtime configuration
Theruntime section configures serving behavior.
The default API endpoint for the deployment.Options:
/v1/embeddings: OpenAI-compatible embeddings endpoint/rerank: Reranking endpoint/predict: Classification/prediction endpoint
/v1/embeddings. Classification models default to /predict.Not applicable to BEI engines. Only used for generative models.
Not applicable to BEI engines. Only used for generative models.
Not applicable to BEI engines. Only used for generative models.
HuggingFace Model Repository Structure
All model sources (S3, GCS, HuggingFace, or tar.gz) must follow the standard HuggingFace repository structure. Files must be in the root directory, similar to running:Model configuration
config.jsonmax_position_embeddings: Limits maximum context size (content beyond this is truncated)id2label: Required dictionary mapping IDs to labels for classification models.- Note: Needs to have len of the shape of the last dense layer. Each dense output needs a
namefor the json response.
- Note: Needs to have len of the shape of the last dense layer. Each dense output needs a
architecture: Must beModelForSequenceClassificationor similar (cannot beForCausalLM)- Note: Remote code execution is not supported; architecture is inferred automatically
torch_dtype: Default inference dtype (BEI-Bert: alwaysfp16, BEI:float16,bfloat16)- Note: We don’t support
pre-quantizedloading, meaning your weights need to befloat16,bfloat16orfloat32for all engines.
- Note: We don’t support
quant_config: Not allowed, as nopre-quantizedweights.
Model weights
model.safetensors (preferred)- Or:
model.safetensors.index.json+model-xx-of-yy.safetensors(sharded) - Note: Convert to safetensors if you encounter issues with other formats
Tokenizer files
tokenizer_config.json and tokenizer.json- Must be “FAST” tokenizers compatible with Rust
- Typically cannot contain custom Python code, will be unread.
Embedding model files (sentence-transformers)
1_Pooling/config.json- Required for embedding models to define pooling strategy
- Required for embedding models
- Shows available pooling layers and configurations
Pooling layer support
| Engine | Classification Layers | Pooling Types | Notes |
|---|---|---|---|
| BEI | 1 layer maximum | Last token, first token | Limited pooling options |
| BEI-Bert | Multiple layers or 1 layer | Last token, first token, mean, SPLADE pooling | Advanced pooling support |
Complete configuration examples
BEI with FP8 quantization (embedding model)
BEI-Bert for small BERT model
BEI for reranking model
BEI-Bert for classification model
Validation and troubleshooting
Common configuration errors
Error:encoder does not have a kv-cache, therefore a kv specfic datatype is not valid
- Cause: Using KV quantization (fp8_kv, fp4_kv) with encoder models
- Fix: Use
fp8orno_quantinstead
FP8 quantization is only supported on L4, H100, H200, B200
- Cause: Using
FP8quantization on unsupported GPU. - Fix: Use H100 or newer GPU, or use
no_quant.
FP4 quantization is only supported on B200
- Cause: Using
FP4quantization on unsupported GPU. - Fix: Use B200 GPU or
FP8quantization.
Performance tuning
For maximum throughput:- Use
max_num_tokens: 16384for BEI. - Enable
FP8quantization on supported hardware. - Use
batch_scheduler_policy: max_utilizationfor high load.
- Use smaller
max_num_tokensfor your use case - Use
batch_scheduler_policy: guaranteed_no_evict - Consider BEI-Bert for small models with cold-start optimization
- Use L4 GPUs with
FP8quantization. - Use BEI-Bert for small models.
- Tune
max_num_tokensto your actual requirements.
Migration from older configurations
If you’re migrating from older BEI configurations:- Update base_model: Change from specific model types to
encoderorencoder_bert - Add checkpoint_repository: Use the new structured repository configuration
- Review quantization: Ensure quantization type matches hardware capabilities
- Update engine: Add engine configuration for better performance