Configuration reference

This reference covers all configuration options for BEI and BEI-Bert deployments. All settings use the trt_llm section in config.yaml.

Configuration structure

trt_llm:
  inference_stack: v1  # Always v1 for BEI
  build:
    base_model: encoder | encoder_bert
    checkpoint_repository: {...}
    max_num_tokens: 16384
    quantization_type: no_quant | fp8 | fp4 | fp4_kv
    quantization_config: {...}
    plugin_configuration: {...}
  runtime:
    webserver_default_route: /v1/embeddings | /rerank | /predict
    kv_cache_free_gpu_mem_fraction: 0.9
    enable_chunked_context: true
    batch_scheduler_policy: guaranteed_no_evict

Build configuration

The build section configures model compilation and optimization settings.

base_model

string

required

The base model architecture determines which BEI variant to use.Options:

encoder: BEI - for causal embedding models (Llama, Mistral, Qwen, Gemma)
encoder_bert: BEI-Bert - for BERT-based models (BERT, RoBERTa, Jina, Nomic)

build:
  base_model: encoder

checkpoint_repository

object

required

Specifies where to find the model checkpoint. Repository must follow the standard HuggingFace structure.Source options:

HF: Hugging Face Hub (default)
GCS: Google Cloud Storage
S3: AWS S3
AZURE: Azure Blob Storage
REMOTE_URL: HTTP URL to tar.gz file
BASETEN_TRAINING: Baseten Training checkpoints

For detailed configuration options including training checkpoints and cloud storage setup, see Deploy training and S3 checkpoints.

checkpoint_repository:
  source: HF
  repo: "BAAI/bge-large-en-v1.5"
  revision: main
  runtime_secret_name: hf_access_token  # Optional, for private repos

max_num_tokens

number

default:"16384 (BEI) / 8192 (BEI-Bert)"

Maximum number of tokens that can be processed in a single batch. BEI and BEI-Bert run without chunked-prefill for performance reasons. This limits the effective context length to the max_position_embeddings value.Range: 64 to 131072, must be multiple of 64. Use higher values (up to 131072) for long context models. Most models use 16384 as default.

build:
  max_num_tokens: 16384

max_seq_len

number

Not supported for BEI engines. Leave this value unset. BEI automatically sets it and truncates if context length is exceeded.

quantization_type

string

default:"no_quant"

Specifies the quantization format for model weights. FP8 quantization maintains accuracy within 1% of FP16 for embedding models.Options for BEI:

no_quant: FP16/BF16 precision
fp8: FP8 weights + 16-bit KV cache
fp4: FP4 weights + 16-bit KV cache (B200 only)
fp4_mlp_only: FP4 MLP weights only (B200 only)

Options for BEI-Bert:

no_quant: FP16 precision (only option)

For detailed quantization guidance, see Quantization guide.

build:
  quantization_type: fp8

quantization_config

object

Configuration for post-training quantization calibration.Fields:

calib_size: Size of calibration dataset (64-16384, multiple of 64)
calib_dataset: HuggingFace dataset for calibration
calib_max_seq_length: Maximum sequence length for calibration

quantization_config:
  calib_size: 1024
  calib_dataset: "abisee/cnn_dailymail"
  calib_max_seq_length: 1536

plugin_configuration

object

BEI automatically configures optimal TensorRT-LLM plugin settings. Manual configuration is not required or supported.Automatic optimizations:

XQA kernels for maximum throughput
Dynamic batching for optimal utilization
Memory-efficient attention mechanisms
Hardware-specific optimizations

Note: Plugin configuration is only available for Engine-Builder-LLM engine.

Runtime configuration

The runtime section configures serving behavior.

webserver_default_route

string

default:"/v1/embeddings"

The default API endpoint for the deployment.Options:

/v1/embeddings: OpenAI-compatible embeddings endpoint
/rerank: Reranking endpoint
/predict: Classification/prediction endpoint

BEI automatically detects embedding models and sets /v1/embeddings. Classification models default to /predict.

runtime:
  webserver_default_route: /v1/embeddings

kv_cache_free_gpu_mem_fraction

number

Available but has no effect for BEI embedding models, which do not use a KV cache. Only relevant for generative (decoder) models.

enable_chunked_context

boolean

Available but has no effect for BEI embedding models. Only relevant for generative (decoder) models.

batch_scheduler_policy

string

Available but has no effect for BEI embedding models. Only relevant for generative (decoder) models.

HuggingFace model repository structure

All model sources (S3, GCS, HuggingFace, or tar.gz) must follow the standard HuggingFace repository structure. Files must be in the root directory, similar to running:

git clone https://huggingface.co/michaelfeil/bge-small-en-v1.5

Model configuration

config.json

max_position_embeddings: Limits maximum context size (content beyond this is truncated)
id2label: Required dictionary mapping IDs to labels for classification models.
- Note: Needs to have len of the shape of the last dense layer. Each dense output needs a name for the json response.
architecture: Must be ModelForSequenceClassification or similar (cannot be ForCausalLM)
- Note: Remote code execution is not supported; architecture is inferred automatically
torch_dtype: Default inference dtype (BEI-Bert: always fp16, BEI: float16, bfloat16)
- Note: We don’t support pre-quantized loading, meaning your weights need to be float16, bfloat16 or float32 for all engines.
quant_config: Not allowed, as no pre-quantized weights.

Model weights

model.safetensors (preferred)

Or: model.safetensors.index.json + model-xx-of-yy.safetensors (sharded)
Note: Convert to safetensors if you encounter issues with other formats

Tokenizer files

tokenizer_config.json and tokenizer.json

Must be “FAST” tokenizers compatible with Rust
Typically cannot contain custom Python code, will be unread.

Embedding model files (sentence-transformers)

1_Pooling/config.json

Required for embedding models to define pooling strategy

modules.json

Required for embedding models
Shows available pooling layers and configurations

Pooling layer support

Engine	Classification Layers	Pooling Types	Notes
BEI	1 layer maximum	Last token, first token	Limited pooling options
BEI-Bert	Multiple layers or 1 layer	Last token, first token, mean, SPLADE pooling	Advanced pooling support

Complete configuration examples

BEI with `FP8` quantization (embedding model)

model_name: BEI-BGE-Large-FP8
resources:
  accelerator: H100
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-Embedding-8B"
      revision: main
    max_num_tokens: 16384
    quantization_type: fp8
    quantization_config:
      calib_size: 1536
      calib_dataset: "abisee/cnn_dailymail"
      calib_max_seq_length: 1536
    # plugin_configuration is auto-configured for BEI models.
    # Encoder models disable paged_kv_cache and use_paged_context_fmha automatically.
  runtime:
    webserver_default_route: /v1/embeddings

BEI-Bert for small BERT model

model_name: BEI-Bert-MiniLM-L6
resources:
  accelerator: L4
  use_gpu: true
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "sentence-transformers/all-MiniLM-L6-v2"
      revision: main
    max_num_tokens: 8192
    quantization_type: no_quant
    # plugin_configuration is auto-configured for BEI-Bert models.
    # paged_kv_cache and use_paged_context_fmha are disabled automatically.
  runtime:
    webserver_default_route: /v1/embeddings

BEI for reranking model

model_name: BEI-BGE-Reranker
resources:
  accelerator: H100
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "BAAI/bge-reranker-large"
      revision: main
    max_num_tokens: 16384
    quantization_type: fp8
    quantization_config:
      calib_size: 1024
      calib_dataset: "abisee/cnn_dailymail"
      calib_max_seq_length: 2048
  runtime:
    webserver_default_route: /rerank

BEI-Bert for classification model

model_name: BEI-Bert-Language-Detection
resources:
  accelerator: L4
  use_gpu: true
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "papluca/xlm-roberta-base-language-detection"
      revision: main
    max_num_tokens: 8192
    quantization_type: no_quant
  runtime:
    webserver_default_route: /predict

Validation and troubleshooting

Common configuration errors

Error: encoder does not have a kv-cache, therefore a kv specific datatype is not valid

Cause: Using KV quantization (fp8_kv, fp4_kv) with encoder models
Fix: Use fp8 or no_quant instead

Error: FP8 quantization is only supported on L4, H100, H200, B200

Cause: Using FP8 quantization on unsupported GPU.
Fix: Use H100 or newer GPU, or use no_quant.

Error: FP4 quantization is only supported on B200

Cause: Using FP4 quantization on unsupported GPU.
Fix: Use B200 GPU or FP8 quantization.

Performance tuning

For maximum throughput:

Use max_num_tokens: 16384 for BEI.
Enable FP8 quantization on supported hardware.
Use batch_scheduler_policy: max_utilization for high load.

For lowest latency:

Use smaller max_num_tokens for your use case
Use batch_scheduler_policy: guaranteed_no_evict
Consider BEI-Bert for small models with cold-start optimization

For cost optimization:

Use L4 GPUs with FP8 quantization.
Use BEI-Bert for small models.
Tune max_num_tokens to your actual requirements.

Migration from older configurations

If you’re migrating from older BEI configurations:

Update base_model: Change from specific model types to encoder or encoder_bert
Add checkpoint_repository: Use the new structured repository configuration
Review quantization: Ensure quantization type matches hardware capabilities
Update engine: Add engine configuration for better performance

Old configuration:

trt_llm:
  build:
    model_type: "bge"
    checkpoint_repo: "BAAI/bge-large-en-v1.5"

New configuration:

trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "BAAI/bge-large-en-v1.5"
    max_num_tokens: 16384
    quantization_type: fp8
  runtime:
    webserver_default_route: /v1/embeddings

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

Configuration reference

Configuration structure

Build configuration

Runtime configuration

HuggingFace model repository structure

Model configuration

Model weights

Tokenizer files

Embedding model files (sentence-transformers)

Pooling layer support

Complete configuration examples

BEI with `FP8` quantization (embedding model)

BEI-Bert for small BERT model

BEI for reranking model

BEI-Bert for classification model

Validation and troubleshooting

Common configuration errors

Performance tuning

Migration from older configurations

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

​Configuration structure

​Build configuration

​Runtime configuration

​HuggingFace model repository structure

​Model configuration

​Model weights

​Tokenizer files

​Embedding model files (sentence-transformers)

​Pooling layer support

​Complete configuration examples

​BEI with FP8 quantization (embedding model)

​BEI-Bert for small BERT model

​BEI for reranking model

​BEI-Bert for classification model

​Validation and troubleshooting

​Common configuration errors

​Performance tuning

​Migration from older configurations

Configuration structure

Build configuration

Runtime configuration

HuggingFace model repository structure

Model configuration

Model weights

Tokenizer files

Embedding model files (sentence-transformers)

Pooling layer support

Complete configuration examples

BEI with `FP8` quantization (embedding model)

BEI-Bert for small BERT model

BEI for reranking model

BEI-Bert for classification model

Validation and troubleshooting

Common configuration errors

Performance tuning

Migration from older configurations