Inference stack
BEI runs on the v1 inference stack. Inconfig.yaml, set inference_stack: v1 and base_model: encoder for causal architectures (Llama, Mistral, Qwen, Gemma) or base_model: encoder_bert for BERT-family encoders. Configuration lives entirely in the Truss config.yaml; the llm_config Management API block applies only to v2. For MoE text generation on v2, see BIS-LLM.
Architectures
BEI runs causal embedding architectures (Llama, Mistral, Qwen, Gemma) withFP8 and FP4 quantization for maximum throughput. For bidirectional encoders like BERT, RoBERTa, Jina, Nomic, and ModernBERT, BEI ships a more specialized variant called BEI-Bert. BEI-Bert runs at FP16 or BF16 and is optimized for cold-start sensitive workloads and models under 4B parameters.
BEI
Causal embeddings with
FP8/FP4 quantization. Up to 1,400 embeddings per second on H100, 121K tokens/s on B200.BEI-Bert
Bidirectional BERT-family encoders at
FP16 or BF16. Tuned for fast cold-start on models under 4B parameters.Workflows
BEI handles three common workflows: embeddings (/v1/embeddings), reranking and classification (/rerank and /predict), and named entity recognition (/predict_tokens, BEI-Bert only). All three share the same trt_llm configuration block; the route and base_model change per workflow.
Embeddings
Causal embedders (Llama, Mistral, Qwen, Gemma) deploy on BEI withbase_model: encoder and pull weights from Hugging Face by default.
modules.json and 1_Pooling/config.json. You do not set pooling in config.yaml. See Pooling layer support for the full matrix including SPLADE on BEI-Bert.
Reranking and classification
Reranking and classification models route to/rerank or /predict and use the same trt_llm block.
/rerank:
[{"index": 0, "score": 0.92}, {"index": 1, "score": 0.14}], ordered by the input texts. Sort by score descending to rerank. Some rerankers (such as michaelfeil/Qwen3-Reranker-8B-seq) expect chat-style prompt templates and need webserver_default_route: /predict instead; use the Performance Client so it applies the right template and autoscaling counts load correctly.
For classification models, set base_model: encoder_bert and webserver_default_route: /predict. The classifier head needs an id2label dictionary in the Hugging Face config; the build fails with a clear error if it is missing.
Named entity recognition
Token-level entity classification deploys on BEI-Bert only and routes to/predict_tokens. The full request/response format and Python example live on Named entity recognition.
OpenAI compatibility
BEI deployments expose/v1/embeddings and work with the standard OpenAI client:
Related
- BEI configuration reference: Full
trt_llmschema, pooling matrix, hardware support, and throughput benchmarks. - BEI-Bert: BERT-specific configuration, model recommendations, and cold-start guidance.
- Named entity recognition:
/predict_tokensrequest and response format. - Embedding examples: Concrete deployment examples.
- Performance Client: High-throughput batch inference for embeddings and reranking.