FP16 or BF16, optimizes cold-start latency, and supports bidirectional attention for sub-4B-parameter encoders.
Bidirectional attention means each token in the input can attend to every other token, in both directions. BERT-family encoders use this pattern, which generally produces better embeddings because each token sees the full context. Causal models like GPT use the opposite pattern: each token attends only to earlier tokens, never to later ones. Some Qwen and Llama checkpoints (the
*Bidirectional model variants listed below) are causal LLMs adapted to run in bidirectional mode specifically for embedding use.BEI vs BEI-Bert
Both variants run on the same engine binary. Pick the variant that matches your base architecture.| Feature | BEI-Bert | BEI |
|---|---|---|
| Architecture | BERT-based (bidirectional) | Causal (unidirectional) |
| Precision | FP16 (16-bit) | BF16, FP16, FP8, FP4 |
| Cold-start | Optimized for fast initialization | Standard startup |
| Quantization | Not supported | FP8, FP4 supported |
| Memory usage | Lower for small models | Higher or equal |
| Throughput | 600-900 embeddings/sec | 800-1400 embeddings/sec |
| Best for | Small BERT models, accuracy-critical | Large models, throughput-critical |
When to use BEI-Bert
Choose BEI-Bert when any of these apply:- BERT-family base architecture:
BertModel,RobertaModel,ModernBertModel,XLMRobertaModel, or a*Bidirectionaladapted checkpoint. - Cold-start matters: first-request latency is critical for your traffic shape.
- Small to medium models: under 4B parameters where
FP8/FP4quantization isn’t needed. - 16-bit precision: workloads where
FP16accuracy is preferred over quantized throughput. - Token-level classification: NER and other
/predict_tokensendpoints run on BEI-Bert only.
FP8/FP4 quantization, use BEI. See the BEI overview.
Supported model families
BEI-Bert runs the following base architectures:BertModel, RobertaModel, ModernBertModel, XLMRobertaModel, Gemma3Bidirectional, Qwen2Bidirectional, Qwen3Bidirectional, LLama3Bidirectional.
Sentence-transformers
The most common BERT-based embedding models, optimized for semantic similarity.sentence-transformers/all-MiniLM-L6-v2(384D, 22M params)sentence-transformers/all-mpnet-base-v2(768D, 110M params)sentence-transformers/multi-qa-mpnet-base-dot-v1(768D, 110M params)
Jina AI
Jina’s BERT-based models for general and code-specific domains.jinaai/jina-embeddings-v2-base-en(512D, 137M params)jinaai/jina-embeddings-v2-base-code(512D, 137M params)jinaai/jina-embeddings-v2-base-es(512D, 137M params)
Nomic AI
Nomic’s models with specialized training for text and code.nomic-ai/nomic-embed-text-v1.5(768D, 137M params)nomic-ai/nomic-embed-code-v1.5(768D, 137M params)
Alibaba GTE and Qwen (bidirectional)
Multilingual models with instruction-tuning and long-context support.Alibaba-NLP/gte-Qwen2-7B-instruct(top-ranked multilingual)Alibaba-NLP/gte-Qwen2-1.5B-instruct(cost-effective alternative)intfloat/multilingual-e5-large-instruct
Bidirectional LLM variants
Some Qwen and Llama checkpoints run in bidirectional mode: each token attends to the full input, which often improves embedding quality over causal pooling.- Qwen2Bidirectional:
Alibaba-NLP/gte-Qwen2-7B-instruct - Qwen3Bidirectional:
voyageai/voyage-4-nano(contact Baseten for deploy config) - Llama3Bidirectional:
nvidia/llama-embed-nemotron-8b
base_model: encoder_bert. The build applies bidirectional attention automatically.
Checkpoint requirements
BEI-Bert builds standard Hugging Face checkpoints only. Repos that requiretrust_remote_code fail at build time. Pin checkpoint_repository.revision when the model maintainer publishes a compatible config on a non-default branch.
For voyageai/voyage-4-nano, the default Hugging Face branch is not compatible with BEI-Bert. Contact your Baseten representative for the current checkpoint_repository settings before you deploy.
Reranking
BEI-Bert runs cross-encoder rerankers via/rerank. Recommended:
BAAI/bge-reranker-large(XLM-RoBERTa)BAAI/bge-reranker-base(XLM-RoBERTa base)Alibaba-NLP/gte-multilingual-reranker-baseAlibaba-NLP/gte-reranker-modernbert-base
Classification
BEI-Bert runs sequence classifiers via/predict. The classifier head needs an id2label dictionary in the Hugging Face config. Recommended:
SamLowe/roberta-base-go_emotions(sentiment)papluca/xlm-roberta-base-language-detection(language ID)
Named entity recognition
Token-level entity classification routes to/predict_tokens and runs on BEI-Bert only. Recommended:
dslim/bert-base-NER-uncased(Truss example)tanaos/tanaos-NER-v1
Model selection by constraint
Choose based on your primary constraint: Balanced cost and performance:Alibaba-NLP/gte-Qwen2-7B-instruct: instruction-tuned, ranked #1 for multilingual.Alibaba-NLP/gte-Qwen2-1.5B-instruct: 1/5 the size, still top-tier.Snowflake/snowflake-arctic-embed-m-v2.0: multilingual-optimized, MRL support.
google/embeddinggemma-300m: 300M params, 100+ languages.nomic-ai/nomic-embed-text-v1.5: 137M, minimal latency.sentence-transformers/all-MiniLM-L6-v2: 22M, legacy standard.
- Code:
jinaai/jina-embeddings-v2-base-code. - Long sequences:
Alibaba-NLP/gte-large-en-v1.5. - Reranking:
BAAI/bge-reranker-large,Alibaba-NLP/gte-reranker-modernbert-base.
Minimal configuration
BEI-Bert deployments setbase_model: encoder_bert and quantization_type: no_quant. Pull weights from Hugging Face by default.
max_num_tokens, GPU support, and complete examples for sentence-transformers, Jina, Nomic, and bidirectional LLM variants, see the BEI configuration reference.
Related
- BEI overview: Causal embeddings, reranking, and OpenAI-compatible inference.
- BEI configuration reference: Full
trt_llmschema, pooling matrix, hardware support, and complete configuration examples. - Named entity recognition:
/predict_tokensrequest and response format. - Embedding examples: Concrete deployment examples.
- Performance Client: High-throughput batch inference for embeddings and reranking.