trt_llm section in config.yaml.
Configuration structure
Build configuration
Fields are tagged Required, Optional, or Computed. Computed fields are set by the engine; do not configure them manually. Thebuild section configures model compilation and optimization settings.
Required. The base model architecture determines which BEI variant to use.Options:
encoder: BEI - for causal embedding models (Llama, Mistral, Qwen, Gemma)encoder_bert: BEI-Bert - for BERT-based models (BERT, RoBERTa, Jina, Nomic)
Required. Specifies where to find the model checkpoint. Repository must follow the standard HuggingFace structure.Source options:
HF: Hugging Face Hub (default)GCS: Google Cloud StorageS3: AWS S3AZURE: Azure Blob StorageREMOTE_URL: HTTP URL to tar.gz fileBASETEN_TRAINING: Baseten Training checkpoints
checkpoint_repository is the weight source for BEI models, and Baseten mirrors it to the Baseten Delivery Network automatically for fast cold starts. Don’t add a top-level weights: section to a BEI config: checkpoint_repository already handles weight loading, so using weights: directly is discouraged.Optional. Maximum number of tokens that can be processed in a single batch. BEI defaults to
16384; BEI-Bert defaults to 8192. BEI and BEI-Bert run without chunked-prefill for performance reasons. This limits the effective context length to the max_position_embeddings value.Range: 64 to 131072, must be multiple of 64. Use higher values (up to 131072) for long context models. Most models use 16384 as default.Computed. Not supported for BEI engines. Leave this value unset. BEI automatically sets it and truncates if context length is exceeded.
Optional. Specifies the quantization format for model weights.
FP8 quantization maintains accuracy within 1% of FP16 for embedding models.Options for BEI:no_quant:FP16/BF16precisionfp8:FP8weights + 16-bit KV cachefp4:FP4weights + 16-bit KV cache (B200 only)fp4_mlp_only:FP4MLP weights only (B200 only)
no_quant:FP16precision (only option)
Optional. Configuration for post-training quantization calibration.Fields:
calib_size: Size of calibration dataset (64-16384, multiple of 64)calib_dataset: HuggingFace dataset for calibrationcalib_max_seq_length: Maximum sequence length for calibration
Computed. BEI automatically configures optimal TensorRT-LLM plugin settings. Manual configuration is not required or supported.Automatic optimizations:
- XQA kernels for maximum throughput
- Dynamic batching for optimal utilization
- Memory-efficient attention mechanisms
- Hardware-specific optimizations
Runtime configuration
Theruntime section configures serving behavior.
Optional. The default API endpoint for the deployment.Options:
/v1/embeddings: OpenAI-compatible embeddings endpoint/rerank: Reranking endpoint/predict: Classification/prediction endpoint
/v1/embeddings. Classification models default to /predict.Computed. Available but has no effect for BEI embedding models, which do not use a KV cache. Only relevant for generative (decoder) models.
Computed. Available but has no effect for BEI embedding models. Only relevant for generative (decoder) models.
Computed. Available but has no effect for BEI embedding models. Only relevant for generative (decoder) models.
HuggingFace model repository structure
All model sources (S3, GCS, HuggingFace, or tar.gz) must follow the standard HuggingFace repository structure. Files must be in the root directory, similar to running:Model configuration
config.jsonmax_position_embeddings: Limits maximum context size (content beyond this is truncated)id2label: Required dictionary mapping IDs to labels for classification models.- Note: Needs to have len of the shape of the last dense layer. Each dense output needs a
namefor the json response.
- Note: Needs to have len of the shape of the last dense layer. Each dense output needs a
architecture: Must beModelForSequenceClassificationor similar (cannot beForCausalLM)- Note: Remote code execution is not supported; architecture is inferred automatically
torch_dtype: Default inference dtype (BEI-Bert: alwaysfp16, BEI:float16,bfloat16)- Note: We don’t support
pre-quantizedloading, meaning your weights need to befloat16,bfloat16orfloat32for all engines.
- Note: We don’t support
quant_config: Not allowed, as nopre-quantizedweights.
Model weights
model.safetensors (preferred)- Or:
model.safetensors.index.json+model-xx-of-yy.safetensors(sharded) - Note: Convert to safetensors if you encounter issues with other formats
Tokenizer files
tokenizer_config.json and tokenizer.json- Must be “FAST” tokenizers compatible with Rust
- Typically cannot contain custom Python code, will be unread.
Embedding model files (sentence-transformers)
1_Pooling/config.json- Required for embedding models to define pooling strategy
- Required for embedding models
- Shows available pooling layers and configurations
modules.json and 1_Pooling/config.json and maps it to one of the modes below.
Flag in 1_Pooling/config.json | Pooling mode | BEI | BEI-Bert |
|---|---|---|---|
pooling_mode_cls_token: true | CLS token (first token) | ✅ | ✅ |
pooling_mode_mean_tokens: true | Mean tokens | ✅ | ✅ |
pooling_mode_lasttoken: true | Last token | ✅ | ✅ |
Pooling layer support
| Engine | Classification Layers | Pooling Types | Notes |
|---|---|---|---|
| BEI | 1 layer maximum | Last token, first token | Limited pooling options |
| BEI-Bert | Multiple layers or 1 layer | Last token, first token, mean, SPLADE pooling | Advanced pooling support |
Throughput benchmarks
Measured against TEI and vLLM on the same hardware. Token throughput uses 500 tokens per request; request throughput uses 5 tokens per request. For the full methodology, see Run Qwen3 Embedding on NVIDIA Blackwell GPUs.| Framework | Precision | GPU | Max tokens/s | Max requests/s |
|---|---|---|---|---|
| TEI | FP16 | H100 | 34,055 | 824.25 |
| BEI-Bert | FP16 | H100 | 36,520 | 841.05 |
| vLLM | BF16 | H100 | 36,625 | 155.23 |
| BEI | BF16 | H100 | 47,549 | 761.44 |
| BEI | FP8 | H100 | 77,107 | 855.96 |
| BEI | FP8 | B200 | 121,443 | 1,310.52 |
Quantization impact
| Quantization | Speed improvement | Memory reduction | Accuracy impact |
|---|---|---|---|
| FP16/BF16 vLLM | Baseline | None | None |
| FP16/BF16 BEI | 1.3x | None | None |
| FP8 BEI | 2x | 50% | ~1% |
| FP4 BEI | 3.5x | 75% | 1-2% |
Hardware support
| GPU | BEI | BEI-Bert | Recommended for |
|---|---|---|---|
| L4 | Full | Full | Cost-effective deployments |
| A10G, A100 | Full | Full | Legacy support |
| T4 | No | Full | Legacy support |
| H100 | Full | Full | Maximum performance |
| B200 | Full | Full | FP4 quantization |
Complete configuration examples
BEI with FP8 quantization (embedding model)
BEI-Bert for small BERT model
BEI for reranking model
BEI-Bert for classification model
BEI-Bert for code embeddings (Jina)
BEI-Bert for bidirectional Qwen2 (long sequences)
Common configuration errors
Error:encoder does not have a kv-cache, therefore a kv specific datatype is not valid
- Cause: Using KV quantization (
fp8_kv,fp4_kv) with encoder models. - Fix: Use
fp8orno_quantinstead.
FP8 quantization is only supported on L4, H100, H200, B200
- Cause: Using
FP8quantization on unsupported GPU. - Fix: Use H100 or newer GPU, or use
no_quant.
FP4 quantization is only supported on B200
- Cause: Using
FP4quantization on unsupported GPU. - Fix: Use B200 GPU or
FP8quantization.