When to use BEI-Bert
Ideal use cases
Model architectures:- Sentence-transformers:
sentence-transformers/all-MiniLM-L6-v2 - Jina models:
jinaai/jina-embeddings-v2-base-en,jinaai/jina-embeddings-v2-base-code - Nomic models:
nomic-ai/nomic-embed-text-v1.5,nomic-ai/nomic-embed-code-v1.5 - BERT variants:
FacebookAI/roberta-base,cardiffnlp/twitter-roberta-base - Gemma3Bidirectional:
google/embeddinggemma-300m - ModernBERT:
answerdotai/ModernBERT-base - Qwen2Bidirectional:
Alibaba-NLP/gte-Qwen2-7B-instruct - QWen3Bidirectional
voyageai/voyage-4-nano - LLama3Bidrectional
nvidia/llama-embed-nemotron-8b
- Cold-start sensitive applications: Where first-request latency is critical
- Small to medium models: (under 4B parameters) where quantization isn’t needed
- High-accuracy requirements: Where 16-bit precision is preferred
- Bidirectional attention: Models with bidirectional attention run best on this engine.
BEI-Bert vs BEI comparison
| Feature | BEI-Bert | BEI |
|---|---|---|
| Architecture | BERT-based (bidirectional) | Causal (unidirectional) |
| Precision | FP16 (16-bit) | BF16/FP16/FP8/FP4 (quantized) |
| Cold-start | Optimized for fast initialization | Standard startup |
| Quantization | Not supported | FP8/FP4 supported |
| Memory usage | Lower for small models | Higher or equal |
| Throughput | 600-900 embeddings/sec | 800-1400 embeddings/sec |
| Best for | Small BERT models, accuracy-critical | Large models, throughput-critical |
Recommended models (MTEB ranking)
Top-tier embeddings
High performance (rank 2-8):Alibaba-NLP/gte-Qwen2-7B-instruct(7.61B): Bidirectional.intfloat/multilingual-e5-large-instruct(560M): Multilingual.google/embeddinggemma-300m(308M): Google’s compact model.
Alibaba-NLP/gte-Qwen2-1.5B-instruct(1.78B): Cost-effective.Salesforce/SFR-Embedding-2_R(7.11B): Salesforce model.Snowflake/snowflake-arctic-embed-l-v2.0(568M): Snowflake large.Snowflake/snowflake-arctic-embed-m-v2.0(305M): Snowflake medium.
WhereIsAI/UAE-Large-V1(335M): UAE large model.nomic-ai/nomic-embed-text-v1(137M): Nomic original.nomic-ai/nomic-embed-text-v1.5(137M): Nomic improved.sentence-transformers/all-mpnet-base-v2(109M): MPNet base.
nomic-ai/nomic-embed-text-v2-moe(475M-A305M): Mixture of experts.Alibaba-NLP/gte-large-en-v1.5(434M): Alibaba large English.answerdotai/ModernBERT-large(396M): Modern BERT large.jinaai/jina-embeddings-v2-base-en(137M): Jina English.jinaai/jina-embeddings-v2-base-code(137M): Jina code.
Re-ranking models
Top re-rankers:BAAI/bge-reranker-large: XLM-RoBERTa based.BAAI/bge-reranker-base: XLM-RoBERTa base.Alibaba-NLP/gte-multilingual-reranker-base: GTE multilingual.Alibaba-NLP/gte-reranker-modernbert-base: ModernBERT reranker.
Classification models
Sentiment analysis:SamLowe/roberta-base-go_emotions: RoBERTa for emotions.
Supported model families
Popular Hugging Face models
Find supported models on Hugging Face:Sentence-transformers
The most common BERT-based embedding models, optimized for semantic similarity. Popular models:sentence-transformers/all-MiniLM-L6-v2(384D, 22M params)sentence-transformers/all-mpnet-base-v2(768D, 110M params)sentence-transformers/multi-qa-mpnet-base-dot-v1(768D, 110M params)
Voyage and Nemotron Bidrectional LLMs
Large-decoder architectures with bidirectional attention like Qwen3 (voyageai/voyage-4-nano) or Llama3 (nvidia/llama-embed-nemotron-8b) can be deployed with BEi-bert.
Configuration:
Jina AI embeddings
Jina’s BERT-based models optimized for various domains including code. Popular models:jinaai/jina-embeddings-v2-base-en(512D, 137M params)jinaai/jina-embeddings-v2-base-code(512D, 137M params)jinaai/jina-embeddings-v2-base-es(512D, 137M params)
Nomic AI embeddings
Nomic’s models with specialized training for text and code. Popular models:nomic-ai/nomic-embed-text-v1.5(768D, 137M params)nomic-ai/nomic-embed-code-v1.5(768D, 137M params)
Alibaba GTE and Qwen models
Advanced multilingual models with instruction-tuning and long-context support. Popular models:Alibaba-NLP/gte-Qwen2-7B-instruct: Top-ranked multilingual.Alibaba-NLP/gte-Qwen2-1.5B-instruct: Cost-effective alternative.intfloat/multilingual-e5-large-instruct: E5 multilingual variant.
Configuration examples
Cost-effective GTE-Qwen deployment
Basic sentence-transformer deployment
Jina code embeddings deployment
Nomic text embeddings with custom routing
Integration examples
OpenAI client with Qwen3 instructions
Baseten Performance Client
For maximum throughput with BEI-Bert:Direct API usage
Best practices
Model selection guide
Choose based on your primary constraint: Cost-effective (balanced performance/cost):Alibaba-NLP/gte-Qwen2-7B-instruct: Instruction-tuned, ranked #1 for multilingual.Alibaba-NLP/gte-Qwen2-1.5B-instruct: 1/5 the size, still top-tier.Snowflake/snowflake-arctic-embed-m-v2.0: Multilingual-optimized, MRL support.
google/embeddinggemma-300m: 300M params, 100+ languages.Snowflake/snowflake-arctic-embed-m-v2.0: 305M, compression-friendly.nomic-ai/nomic-embed-text-v1.5: 137M, minimal latency.sentence-transformers/all-MiniLM-L6-v2: 22M, legacy standard.
- Code:
jinaai/jina-embeddings-v2-base-code - Long sequences:
Alibaba-NLP/gte-large-en-v1.5 - Re-ranking:
BAAI/bge-reranker-large,Alibaba-NLP/gte-reranker-modernbert-base
Hardware optimization
Cost-effective deployments:- L4 GPUs for models
<200Mparameters - H100 GPUs for models 200-500M parameters
- Enable autoscaling for variable traffic
- Use
max_num_tokens: 8192for most use cases - Use
max_num_tokens: 16384for long documents - Tune
batch_scheduler_policybased on traffic patterns
Deployment strategies
For development:- Start with smaller models (MiniLM)
- Use L4 GPUs for cost efficiency
- Enable detailed logging
- Use larger models (MPNet) for better quality
- Use H100 GPUs for better performance
- Implement monitoring and alerting
- Use smallest suitable models
- Optimize for cold-start performance
- Consider model size constraints
Troubleshooting
Common issues
Slow cold-start times:- Ensure model is properly cached
- Consider using smaller models
- Check GPU memory availability
- Verify
max_num_tokensis appropriate - Check
batch_scheduler_policysettings - Monitor GPU utilization
- Reduce
max_num_tokensif needed - Use smaller models for available memory
- Monitor memory usage during deployment
Performance tuning
For lower latency:- Reduce
max_num_tokens - Use
batch_scheduler_policy: guaranteed_no_evict - Consider smaller models
- Increase
max_num_tokensappropriately - Use
batch_scheduler_policy: max_utilization - Optimize batch sizes in client code
- Use L4 GPUs when possible
- Choose appropriately sized models
- Implement efficient autoscaling
Migration from other systems
From sentence-transformers library
Python code:From other embedding services
BEI-Bert provides OpenAI-compatible endpoints:- Update base URL: Point to Baseten deployment
- Update API key: Use Baseten API key
- Test compatibility: Verify embedding dimensions and quality
- Optimize: Tune batch sizes and concurrency for performance
Further reading
- BEI overview - General BEI documentation
- BEI reference config - Complete configuration options
- Embedding examples - Concrete deployment examples
- Performance client documentation - Client Usage with Embeddings
- Performance optimization - General performance guidance