BEI vs BEI-Bert
BEI comes in two variants, each optimized for different model architectures:BEI
Causal embedding models with quantization support and maximum throughput.
BEI-Bert
BERT-based models with cold-start optimization, 16-bit precision and bidirectional attention.
BEI features
Use BEI when:- Model uses causal architecture (Llama, Mistral, Qwen for embeddings)
- You need quantization support (FP8, FP4)
- Maximum throughput is required
- Models like BAAI/bge, Qwen3-Embedding, Salesforce/SFR-Embedding
- Quantization Support: FP8 and FP4 quantization for 2-4x speedup
- Highest Throughput: Up to 1400 client embeddings per second
- XQA Kernels: Optimized attention kernels for maximum performance
- Dynamic Batching: Automatic batch optimization for varying loads
LlamaModel(e.g., BAAI/bge-multilingual-gemma2)MistralModel(e.g., Salesforce/SFR-Embedding-Mistral)Qwen2Model(e.g., Qwen/Qwen3-Embedding-8B)Gemma2Model(e.g., Google/EmbeddingGemma)
BEI-Bert features
Use BEI-Bert when:- Model uses BERT-based architecture (sentence-transformers, jinaai, nomic-ai) or generic bidirectional attention models
- You need cold-start optimization for small models (
<4Bparameters) - 16-bit precision is sufficient for your use case
- Model architectures like Jina-BERT, Nomic, or ModernBERT
- Cold-Start Optimization: Optimized for fast initialization and small models
- 16-bit Precision: Models run in FP16/BF16 precision
- BERT Architecture Support: Specialized optimization for bidirectional models
- Low Memory Footprint: Efficient for smaller models and edge deployments
BertModel(e.g., sentence-transformers/all-MiniLM-L6-v2)RobertaModel(e.g., FacebookAI/roberta-base)Jina-BERT(e.g., jinaai/jina-embeddings-v2-base-en)Nomic-BERT(e.g., nomic-ai/nomic-embed-text-v1.5)
Model types and use cases
Embedding models
Embedding models convert text into numerical representations for semantic search, clustering, and retrieval-augmented generation (RAG). Examples:- BAAI/bge-large-en-v1.5: General-purpose English embeddings
- michaelfeil/Qwen3-Embedding-8B-auto: Multilingual embeddings with quantization support
- Salesforce/SFR-Embedding-Mistral: Instruction-tuned embeddings
Reranking models
Reranking models are actually classification models that score document relevance for search and retrieval tasks. They work by classifying query-document pairs as relevant or not relevant. How rerankers work:- Rerankers are sequence classification models (ending with
ForSequenceClassification) - They take a query and document as input and output a relevance score
- The “reranking” is accomplished by scoring multiple documents and ranking them by the classification score
- You can implement reranking by using the classification endpoint with proper prompt templates
- BAAI/bge-reranker-v2-m3: Great reranking model (279M params). Performs well in RAG systems where a first pass of vector retrieval surfaces dozens of snippets of data.
- michaelfeil/Qwen3-Reranker-8B-seq: Best multilingual and general purpose reranker. Note: Needs to be used with the
webserver_default_route: /predictsetting.
/predict endpoint with proper prompt formatting for query-document pairs. The baseten-performance-client handles reranking template formatting automatically.
Classification models
Classification models categorize text into predefined classes for tasks like sentiment analysis, content moderation, and language detection. Examples:- papluca/xlm-roberta-base-language-detection: Language identification
- samlowe/roberta-base-go_emotions: Emotion classification
- Reward Models: RLHF reward model examples
Performance and optimization
Throughput benchmarks
For detailed performance benchmarks, see: Run Qwen3 Embedding on NVIDIA Blackwell GPUs| Framework | Precision | GPU | Max Token/s Throughput | Max Request/s Throughput |
|---|---|---|---|---|
| TEI | BF16 | H100 | 34,055 | 824.25 |
| vLLM | BF16 | H100 | 36,625 | 155.23 |
| BEI | BF16 | H100 | 47,549 | 761.44 |
| BEI | FP8 | H100 | 77,107 | 855.96 |
| BEI | FP8 | B200 | 121,443 | 1,310.52 |
- Token Throughput/s: Measured on 500 tokens per request
- Request Throughput/s: Measured on 5 tokens per request
Quantization impact
| Quantization | Speed Improvement | Memory Reduction | Accuracy Impact |
|---|---|---|---|
| FP16/BF16 vLLM | Baseline | None | None |
| FP16/BF16 BEI | 1.3x | None | None |
| FP8 BEI | 2x faster | 50% | ~1% |
| FP4 BEI | 3.5x faster | 75% | 1-2% |
Hardware requirements
| GPU Type | BEI Support | BEI-Bert Support | Recommended For |
|---|---|---|---|
| L4 | Full | Full | Cost-effective deployments |
| A10G, A100 | Full | Full | Legacy support |
| T4 | No | Support | Legacy support |
| H100 | Full | Full | Maximum performance |
| B200 | Full | Full | Fp4 quantization |
OpenAI compatibility
BEI deployments are fully OpenAI compatible for embeddings:Baseten Performance Client
For maximum throughput, use the Baseten Performance Client.Reference config
For complete configuration options, see the BEI reference config.Key configuration options
Production best practices
GPU selection guidelines
- L4: Best for models
<4Bparameters, cost-effective - H100: Required for models 4B+ parameters or long context (>8K tokens)
- H100_40GB: Use for models with memory constraints
Build job optimization
Model-specific recommendations
BERT-based models (BEI-Bert):- Use
encoder_bertbase model - No quantization support (FP16/BF16 only)
- Best for models
<200Mparameters on L4
- Support longer contexts (up to 8192 tokens)
- Use H100 for models >1B parameters
- Consider memory requirements for long sequences
- Use regular FP8 quantization
- Support very long contexts (up to 131K tokens)
- Higher memory requirements for long sequences
Token limit optimization
Getting started
- Choose your variant: BEI for causal models and quantization, BEI-Bert for BERT models
- Review configuration: See BEI reference config
- Deploy your model: Use the configuration templates and examples
- Test integration: Use OpenAI client or Performance Client for maximum throughput
Examples and further reading
- BEI-Bert examples - BERT-specific configurations
- BEI reference config - Complete configuration options
- Embedding examples - Concrete deployment examples
- Performance client documentation - Client Usage with Embeddings