When to use BEI-Bert
Ideal use cases
Model architectures:- Sentence-transformers:
sentence-transformers/all-MiniLM-L6-v2 - Jina models:
jinaai/jina-embeddings-v2-base-en,jinaai/jina-embeddings-v2-base-code - Nomic models:
nomic-ai/nomic-embed-text-v1.5,nomic-ai/nomic-embed-code-v1.5 - BERT variants:
FacebookAI/roberta-base,cardiffnlp/twitter-roberta-base - Gemma3Bidirectional:
google/embeddinggemma-300m - ModernBERT:
answerdotai/ModernBERT-base - Qwen2Bidirectional:
Alibaba-NLP/gte-Qwen2-7B-instruct
- Cold-start sensitive applications: Where first-request latency is critical
- Small to medium models: (
<4Bparameters) where quantization isn’t needed - High-accuracy requirements: Where 16-bit precision is preferred
- bidirectional attention: Models that have directional attention run best on this engine.
BEI-Bert vs BEI comparison
| Feature | BEI-Bert | BEI |
|---|---|---|
| Architecture | BERT-based (bidirectional) | Causal (unidirectional) |
| Precision | FP16 (16-bit) | BF16/FP16/FP8/FP4 (quantized) |
| Cold-start | Optimized for fast initialization | Standard startup |
| Quantization | Not supported | FP8/FP4 supported |
| Memory Usage | Lower for small models | Higher or equal |
| Throughput | 600-900 embeddings/sec | 800-1400 embeddings/sec |
| Best For | Small BERT models, accuracy-critical | Large models, throughput-critical |
Supported model families
Sentence-transformers
The most common BERT-based embedding models, optimized for semantic similarity. Popular models:sentence-transformers/all-MiniLM-L6-v2(384D, 22M params)sentence-transformers/all-mpnet-base-v2(768D, 110M params)sentence-transformers/multi-qa-mpnet-base-dot-v1(768D, 110M params)
Jina AI embeddings
Jina’s BERT-based models optimized for various domains including code. Popular models:jinaai/jina-embeddings-v2-base-en(512D, 137M params)jinaai/jina-embeddings-v2-base-code(512D, 137M params)jinaai/jina-embeddings-v2-base-es(512D, 137M params)
Nomic AI embeddings
Nomic’s models with specialized training for text and code. Popular models:nomic-ai/nomic-embed-text-v1.5(768D, 137M params)nomic-ai/nomic-embed-code-v1.5(768D, 137M params)
RoBERTa and variants
Facebook AI’s RoBERTa and other BERT variants for specific domains. Popular models:FacebookAI/roberta-base(768D, 125M params)cardiffnlp/twitter-roberta-base(768D, 125M params)
Configuration examples
Basic sentence-transformer deployment
Jina code embeddings deployment
Nomic text embeddings with custom routing
Integration examples
OpenAI client usage
Baseten Performance Client
For maximum throughput with BEI-Bert:Direct API usage
Best practices
Model selection
For general purpose:- Use
sentence-transformers/all-MiniLM-L6-v2for balance of speed and quality - Use
sentence-transformers/all-mpnet-base-v2for higher quality
- Use
jinaai/jina-embeddings-v2-base-codefor general code - Use
nomic-ai/nomic-embed-code-v1.5for specialized code tasks
- Use
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 - Use
jinaai/jina-embeddings-v2-base-esfor Spanish
Hardware optimization
Cost-effective deployments:- L4 GPUs for models
<200Mparameters - H100 GPUs for models 200-500M parameters
- Enable autoscaling for variable traffic
- Use
max_num_tokens: 8192for most use cases - Use
max_num_tokens: 16384for long documents - Tune
batch_scheduler_policybased on traffic patterns
Deployment strategies
For development:- Start with smaller models (MiniLM)
- Use L4 GPUs for cost efficiency
- Enable detailed logging
- Use larger models (MPNet) for better quality
- Use H100 GPUs for better performance
- Implement monitoring and alerting
- Use smallest suitable models
- Optimize for cold-start performance
- Consider model size constraints
Troubleshooting
Common issues
Slow cold-start times:- Ensure model is properly cached
- Consider using smaller models
- Check GPU memory availability
- Verify
max_num_tokensis appropriate - Check
batch_scheduler_policysettings - Monitor GPU utilization
- Reduce
max_num_tokensif needed - Use smaller models for available memory
- Monitor memory usage during deployment
Performance tuning
For lower latency:- Reduce
max_num_tokens - Use
batch_scheduler_policy: guaranteed_no_evict - Consider smaller models
- Increase
max_num_tokensappropriately - Use
batch_scheduler_policy: max_utilization - Optimize batch sizes in client code
- Use L4 GPUs when possible
- Choose appropriately sized models
- Implement efficient autoscaling
Migration from other systems
From sentence-transformers library
Python code:From other embedding services
BEI-Bert provides OpenAI-compatible endpoints, making migration straightforward:- Update base URL: Point to Baseten deployment
- Update API key: Use Baseten API key
- Test compatibility: Verify embedding dimensions and quality
- Optimize: Tune batch sizes and concurrency for performance
Further reading
- BEI overview - General BEI documentation
- BEI reference config - Complete configuration options
- Embedding examples - Concrete deployment examples
- Performance client documentation - Client Usage with Embeddings
- Performance optimization - General performance guidance