Skip to main content
This reference covers all configuration options for BEI and BEI-Bert deployments. All settings use the trt_llm section in config.yaml.

Configuration structure

trt_llm:
  inference_stack: v1  # Always v1 for BEI
  build:
    base_model: encoder | encoder_bert
    checkpoint_repository: {...}
    max_num_tokens: 16384
    quantization_type: no_quant | fp8 | fp4 | fp4_mlp_only
    quantization_config: {...}
    plugin_configuration: {...}
  runtime:
    webserver_default_route: /v1/embeddings | /rerank | /predict

Build configuration

Fields are tagged Required, Optional, or Computed. Computed fields are set by the engine; do not configure them manually. The build section configures model compilation and optimization settings.
base_model
string
required
Required. The base model architecture determines which BEI variant to use.Options:
  • encoder: BEI - for causal embedding models (Llama, Mistral, Qwen, Gemma)
  • encoder_bert: BEI-Bert - for BERT-based models (BERT, RoBERTa, Jina, Nomic)
build:
  base_model: encoder
checkpoint_repository
object
required
Required. Specifies where to find the model checkpoint. Repository must follow the standard HuggingFace structure.Source options:
  • HF: Hugging Face Hub (default)
  • GCS: Google Cloud Storage
  • S3: AWS S3
  • AZURE: Azure Blob Storage
  • REMOTE_URL: HTTP URL to tar.gz file
  • BASETEN_TRAINING: Baseten Training checkpoints
For training checkpoint deployment, see Deploy with optimized inference engines. For cloud storage sources (GCS, S3, Azure), see Deploy from cloud storage.
checkpoint_repository:
  source: HF
  repo: "BAAI/bge-large-en-v1.5"
  revision: main
  runtime_secret_name: hf_access_token  # Optional, for private repos
checkpoint_repository is the weight source for BEI models, and Baseten mirrors it to the Baseten Delivery Network automatically for fast cold starts. Don’t add a top-level weights: section to a BEI config: checkpoint_repository already handles weight loading, so using weights: directly is discouraged.
max_num_tokens
number
default:"16384 (BEI) / 8192 (BEI-Bert)"
Optional. Maximum number of tokens that can be processed in a single batch. BEI defaults to 16384; BEI-Bert defaults to 8192. BEI and BEI-Bert run without chunked-prefill for performance reasons. This limits the effective context length to the max_position_embeddings value.Range: 64 to 131072, must be multiple of 64. Use higher values (up to 131072) for long context models. Most models use 16384 as default.
build:
  max_num_tokens: 16384
max_seq_len
number
Computed. Not supported for BEI engines. Leave this value unset. BEI automatically sets it and truncates if context length is exceeded.
quantization_type
string
default:"no_quant"
Optional. Specifies the quantization format for model weights. FP8 quantization maintains accuracy within 1% of FP16 for embedding models.Options for BEI:
  • no_quant: FP16/BF16 precision
  • fp8: FP8 weights + 16-bit KV cache
  • fp4: FP4 weights + 16-bit KV cache (B200 only)
  • fp4_mlp_only: FP4 MLP weights only (B200 only)
Options for BEI-Bert:
  • no_quant: FP16 precision (only option)
For detailed quantization guidance, see Quantization guide.
build:
  quantization_type: fp8
quantization_config
object
Optional. Configuration for post-training quantization calibration.Fields:
  • calib_size: Size of calibration dataset (64-16384, multiple of 64)
  • calib_dataset: HuggingFace dataset for calibration
  • calib_max_seq_length: Maximum sequence length for calibration
quantization_config:
  calib_size: 1024
  calib_dataset: "abisee/cnn_dailymail"
  calib_max_seq_length: 1536
plugin_configuration
object
Computed. BEI automatically configures optimal TensorRT-LLM plugin settings. Manual configuration is not required or supported.Automatic optimizations:
  • XQA kernels for maximum throughput
  • Dynamic batching for optimal utilization
  • Memory-efficient attention mechanisms
  • Hardware-specific optimizations
Note: Plugin configuration is only available for Engine-Builder-LLM engine.

Runtime configuration

The runtime section configures serving behavior.
webserver_default_route
string
default:"/v1/embeddings"
Optional. The default API endpoint for the deployment.Options:
  • /v1/embeddings: OpenAI-compatible embeddings endpoint
  • /rerank: Reranking endpoint
  • /predict: Classification/prediction endpoint
BEI automatically detects embedding models and sets /v1/embeddings. Classification models default to /predict.
runtime:
  webserver_default_route: /v1/embeddings
kv_cache_free_gpu_mem_fraction
number
Computed. Available but has no effect for BEI embedding models, which do not use a KV cache. Only relevant for generative (decoder) models.
enable_chunked_context
boolean
Computed. Available but has no effect for BEI embedding models. Only relevant for generative (decoder) models.
batch_scheduler_policy
string
Computed. Available but has no effect for BEI embedding models. Only relevant for generative (decoder) models.

HuggingFace model repository structure

All model sources (S3, GCS, HuggingFace, or tar.gz) must follow the standard HuggingFace repository structure. Files must be in the root directory, similar to running:
git clone https://huggingface.co/michaelfeil/bge-small-en-v1.5

Model configuration

config.json
  • max_position_embeddings: Limits maximum context size (content beyond this is truncated)
  • id2label: Required dictionary mapping IDs to labels for classification models.
    • Note: Needs to have len of the shape of the last dense layer. Each dense output needs a name for the json response.
  • architecture: Must be ModelForSequenceClassification or similar (cannot be ForCausalLM)
    • Note: Remote code execution is not supported; architecture is inferred automatically
  • torch_dtype: Default inference dtype (BEI-Bert: always fp16, BEI: float16, bfloat16)
    • Note: We don’t support pre-quantized loading, meaning your weights need to be float16, bfloat16 or float32 for all engines.
  • quant_config: Not allowed, as no pre-quantized weights.

Model weights

model.safetensors (preferred)
  • Or: model.safetensors.index.json + model-xx-of-yy.safetensors (sharded)
  • Note: Convert to safetensors if you encounter issues with other formats

Tokenizer files

tokenizer_config.json and tokenizer.json
  • Must be “FAST” tokenizers compatible with Rust
  • Typically cannot contain custom Python code, will be unread.

Embedding model files (sentence-transformers)

1_Pooling/config.json
  • Required for embedding models to define pooling strategy
modules.json
  • Required for embedding models
  • Shows available pooling layers and configurations
At build time, BEI reads pooling mode from modules.json and 1_Pooling/config.json and maps it to one of the modes below.
Flag in 1_Pooling/config.jsonPooling modeBEIBEI-Bert
pooling_mode_cls_token: trueCLS token (first token)
pooling_mode_mean_tokens: trueMean tokens
pooling_mode_lasttoken: trueLast token
If either file is missing on an embedding checkpoint, the build fails with a clear error naming the missing path. Sequence classification and reranking models skip pooling detection and use the classification head instead.

Pooling layer support

EngineClassification LayersPooling TypesNotes
BEI1 layer maximumLast token, first tokenLimited pooling options
BEI-BertMultiple layers or 1 layerLast token, first token, mean, SPLADE poolingAdvanced pooling support

Throughput benchmarks

Measured against TEI and vLLM on the same hardware. Token throughput uses 500 tokens per request; request throughput uses 5 tokens per request. For the full methodology, see Run Qwen3 Embedding on NVIDIA Blackwell GPUs.
FrameworkPrecisionGPUMax tokens/sMax requests/s
TEIFP16H10034,055824.25
BEI-BertFP16H10036,520841.05
vLLMBF16H10036,625155.23
BEIBF16H10047,549761.44
BEIFP8H10077,107855.96
BEIFP8B200121,4431,310.52

Quantization impact

QuantizationSpeed improvementMemory reductionAccuracy impact
FP16/BF16 vLLMBaselineNoneNone
FP16/BF16 BEI1.3xNoneNone
FP8 BEI2x50%~1%
FP4 BEI3.5x75%1-2%

Hardware support

GPUBEIBEI-BertRecommended for
L4FullFullCost-effective deployments
A10G, A100FullFullLegacy support
T4NoFullLegacy support
H100FullFullMaximum performance
B200FullFullFP4 quantization

Complete configuration examples

BEI with FP8 quantization (embedding model)

model_name: BEI-BGE-Large-FP8
resources:
  accelerator: H100
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen3-Embedding-8B"
      revision: main
    max_num_tokens: 16384
    quantization_type: fp8
    quantization_config:
      calib_size: 1536
      calib_dataset: "abisee/cnn_dailymail"
      calib_max_seq_length: 1536
    # plugin_configuration is auto-configured for BEI models.
    # Encoder models disable paged_kv_cache and use_paged_context_fmha automatically.
  runtime:
    webserver_default_route: /v1/embeddings

BEI-Bert for small BERT model

model_name: BEI-Bert-MiniLM-L6
resources:
  accelerator: L4
  use_gpu: true
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "sentence-transformers/all-MiniLM-L6-v2"
      revision: main
    max_num_tokens: 8192
    quantization_type: no_quant
    # plugin_configuration is auto-configured for BEI-Bert models.
    # paged_kv_cache and use_paged_context_fmha are disabled automatically.
  runtime:
    webserver_default_route: /v1/embeddings

BEI for reranking model

model_name: BEI-BGE-Reranker
resources:
  accelerator: H100
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "BAAI/bge-reranker-large"
      revision: main
    max_num_tokens: 16384
    quantization_type: fp8
    quantization_config:
      calib_size: 1024
      calib_dataset: "abisee/cnn_dailymail"
      calib_max_seq_length: 2048
  runtime:
    webserver_default_route: /rerank

BEI-Bert for classification model

model_name: BEI-Bert-Language-Detection
resources:
  accelerator: L4
  use_gpu: true
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "papluca/xlm-roberta-base-language-detection"
      revision: main
    max_num_tokens: 8192
    quantization_type: no_quant
  runtime:
    webserver_default_route: /predict

BEI-Bert for code embeddings (Jina)

model_name: BEI-Bert-Jina-Code
resources:
  accelerator: H100
  use_gpu: true
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "jinaai/jina-embeddings-v2-base-code"
      revision: main
    max_num_tokens: 8192
    quantization_type: no_quant
  runtime:
    webserver_default_route: /v1/embeddings
    kv_cache_free_gpu_mem_fraction: 0.9
    batch_scheduler_policy: guaranteed_no_evict

BEI-Bert for bidirectional Qwen2 (long sequences)

model_name: BEI-Bert-GTE-Qwen-1.5B
resources:
  accelerator: L4
  use_gpu: true
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "Alibaba-NLP/gte-Qwen2-1.5B-instruct"
      revision: main
    max_num_tokens: 8192
    quantization_type: no_quant
  runtime:
    webserver_default_route: /v1/embeddings
    kv_cache_free_gpu_mem_fraction: 0.85
    batch_scheduler_policy: guaranteed_no_evict

Common configuration errors

Error: encoder does not have a kv-cache, therefore a kv specific datatype is not valid
  • Cause: Using KV quantization (fp8_kv, fp4_kv) with encoder models.
  • Fix: Use fp8 or no_quant instead.
Error: FP8 quantization is only supported on L4, H100, H200, B200
  • Cause: Using FP8 quantization on unsupported GPU.
  • Fix: Use H100 or newer GPU, or use no_quant.
Error: FP4 quantization is only supported on B200
  • Cause: Using FP4 quantization on unsupported GPU.
  • Fix: Use B200 GPU or FP8 quantization.