BEI vs BEI-Bert
BEI comes in two variants, each optimized for different model architectures:BEI
Causal embedding models with quantization support and maximum throughput.
BEI-Bert
BERT-based models with cold-start optimization, 16-bit precision and bidirectional attention.
BEI features
Use BEI when:- Model uses causal architecture (Llama, Mistral, Qwen for embeddings)
- You need quantization support (FP8, FP4)
- Maximum throughput is required
- Models like BAAI/bge, Qwen3-Embedding, Salesforce/SFR-Embedding
- Quantization Support: FP8 and FP4 quantization for 2-4x speedup
- Highest Throughput: Up to 1400 client embeddings per second
- XQA Kernels: Optimized attention kernels for maximum performance
- Dynamic Batching: Automatic batch optimization for varying loads
LlamaModel(e.g., BAAI/bge-multilingual-gemma2)MistralModel(e.g., Salesforce/SFR-Embedding-Mistral)Qwen2Model(e.g., Qwen/Qwen3-Embedding-8B)Gemma2Model(e.g., Google/EmbeddingGemma)
BEI-Bert features
Use BEI-Bert when:- Model uses BERT-based architecture (sentence-transformers, jinaai, nomic-ai) or generic bidirectional attention models
- You need cold-start optimization for small models (
<4Bparameters) - 16-bit precision is sufficient for your use case
- Model architectures like Jina-BERT, Nomic, or ModernBERT
- Cold-Start Optimization: Optimized for fast initialization and small models
- 16-bit Precision: Models run in FP16 precision
- BERT Architecture Support: Specialized optimization for bidirectional models
- Low Memory Footprint: Efficient for smaller models and edge deployments
BertModel(e.g., sentence-transformers/all-MiniLM-L6-v2)RobertaModel(e.g., FacebookAI/roberta-base)Jina-BERT(e.g., jinaai/jina-embeddings-v2-base-en)Nomic-BERT(e.g., nomic-ai/nomic-embed-text-v1.5)Alibaba-GTE(e.g., Alibaba-NLP/gte-large-en-v1.5)Llama Bidirectional(e.g., nvidia/llama-embed-nemotron-8b)
Model types and use cases
Embedding models
Embedding models convert text into numerical representations for semantic search, clustering, and retrieval-augmented generation (RAG). Examples:- BAAI/bge-large-en-v1.5: General-purpose English embeddings
- michaelfeil/Qwen3-Embedding-8B-auto: Multilingual embeddings with quantization support
- Salesforce/SFR-Embedding-Mistral: Instruction-tuned embeddings
Reranking models
Reranking models are actually classification models that score document relevance for search and retrieval tasks. They work by classifying query-document pairs as relevant or not relevant. How rerankers work:- Rerankers are sequence classification models (ending with
ForSequenceClassification) - They take a query and document as input and output a relevance score
- The “reranking” is accomplished by scoring multiple documents and ranking them by the classification score
- You can implement reranking by using the classification endpoint with proper prompt templates
- BAAI/bge-reranker-v2-m3: Great reranking model (279M params). Performs well in RAG systems where a first pass of vector retrieval surfaces dozens of snippets of data.
- michaelfeil/Qwen3-Reranker-8B-seq: Best multilingual and general-purpose reranker. Note: Needs to be used with the
webserver_default_route: /predictsetting.
/predict endpoint with proper prompt formatting for query-document pairs. The baseten-performance-client handles reranking template formatting automatically.
Classification models
Classification models categorize text into predefined classes for tasks like sentiment analysis, content moderation, and language detection. Examples:- papluca/xlm-roberta-base-language-detection: Language identification
- samlowe/roberta-base-go_emotions: Emotion classification
- Reward Models: RLHF reward model examples
Named entity recognition (NER): BEI-Bert only
NER models classify each token in the input text into entity categories such as person (PER), organization (ORG), location (LOC), and miscellaneous (MISC). NER models use the ForTokenClassification architecture and the /predict_tokens endpoint. NER requires BEI-Bert (base_model: encoder_bert) and is not supported on BEI.
Recommended:
- dslim/bert-base-NER-uncased: Fast, compact NER model for English. (truss example)
- tanaos/tanaos-NER-v1: General-purpose NER model.
| Field | Type | Description |
|---|---|---|
inputs | list of list of strings | Batched text inputs to classify. Each inner list is a batch of texts. |
raw_scores | boolean | When true, returns raw logit scores for all labels per token. When false, returns only the top predicted label with its probability. |
truncate | boolean | Truncates inputs that exceed the model’s max sequence length. |
truncation_direction | string | Controls which end is truncated. Defaults to "Right". |
aggregation_strategy | string | Merges sub-word tokens into entity spans. Accepts "none", "simple", "first", "average", or "max". Use "max" to match the behavior of transformers.pipeline("ner", aggregation_strategy="max"). Omit or use "none" for token-level predictions. |
aggregation_strategy: "max" (recommended for production):
aggregation_strategy: "none" and raw_scores: true (token-level with BIO labels):
B- marks the beginning of an entity, I- marks a continuation, and O means outside any entity.
Python example (using Baseten Performance Client):
/predict_tokens route directly. The /predict_tokens route also supports async inference.
Performance and optimization
Throughput benchmarks
For detailed performance benchmarks, see: Run Qwen3 Embedding on NVIDIA Blackwell GPUs| Framework | Precision | GPU | Max Token/s Throughput | Max Request/s Throughput |
|---|---|---|---|---|
| TEI | FP16 | H100 | 34,055 | 824.25 |
| BEI-Bert | FP16 | H100 | 36,520 | 841.05 |
| vLLM | BF16 | H100 | 36,625 | 155.23 |
| BEI | BF16 | H100 | 47,549 | 761.44 |
| BEI | FP8 | H100 | 77,107 | 855.96 |
| BEI | FP8 | B200 | 121,443 | 1,310.52 |
- Token Throughput/s: Measured on 500 tokens per request
- Request Throughput/s: Measured on 5 tokens per request
Quantization impact
| Quantization | Speed Improvement | Memory Reduction | Accuracy Impact |
|---|---|---|---|
| FP16/BF16 vLLM | Baseline | None | None |
| FP16/BF16 BEI | 1.3x | None | None |
| FP8 BEI | 2x faster | 50% | ~1% |
| FP4 BEI | 3.5x faster | 75% | 1-2% |
Hardware requirements
| GPU Type | BEI Support | BEI-Bert Support | Recommended For |
|---|---|---|---|
| L4 | Full | Full | Cost-effective deployments |
| A10G, A100 | Full | Full | Legacy support |
| T4 | No | Full | Legacy support |
| H100 | Full | Full | Maximum performance |
| B200 | Full | Full | FP4 quantization |
OpenAI compatibility
BEI deployments are fully OpenAI compatible for embeddings:Baseten Performance Client
For maximum throughput, use the Baseten Performance Client.Reference config
For complete configuration options, see the BEI reference config.Key configuration options
Production best practices
GPU selection guidelines
- L4: Best for models
<4Bparameters, cost-effective - H100: Required for models 4B+ parameters or long context (>8K tokens)
- H100_40GB: Use for models with memory constraints
Build job optimization
Model-specific recommendations
BERT-based models (BEI-Bert):- Use
encoder_bertbase model - No quantization support (FP16/BF16 only)
- Best for models
<200Mparameters on L4
- Support longer contexts (up to 8192 tokens)
- Use H100 for models >1B parameters
- Consider memory requirements for long sequences
- Use regular FP8 quantization
- Support very long contexts (up to 131K tokens)
- Higher memory requirements for long sequences
Token limit optimization
Getting started
- Choose your variant: BEI for causal models and quantization, BEI-Bert for BERT models
- Review configuration: See BEI reference config
- Deploy your model: Use the configuration templates and examples
- Test integration: Use OpenAI client or Performance Client for maximum throughput
Examples and further reading
- BEI-Bert examples - BERT-specific configurations
- BEI reference config - Complete configuration options
- Embedding examples - Concrete deployment examples
- Performance client documentation - Client Usage with Embeddings