Skip to main content
BEI-Bert is a variant of Baseten Embeddings Inference for BERT-family architectures. It runs at FP16 or BF16, optimizes cold-start latency, and supports bidirectional attention for sub-4B-parameter encoders.
Bidirectional attention means each token in the input can attend to every other token, in both directions. BERT-family encoders use this pattern, which generally produces better embeddings because each token sees the full context. Causal models like GPT use the opposite pattern: each token attends only to earlier tokens, never to later ones. Some Qwen and Llama checkpoints (the *Bidirectional model variants listed below) are causal LLMs adapted to run in bidirectional mode specifically for embedding use.

BEI vs BEI-Bert

Both variants run on the same engine binary. Pick the variant that matches your base architecture.
FeatureBEI-BertBEI
ArchitectureBERT-based (bidirectional)Causal (unidirectional)
PrecisionFP16 (16-bit)BF16, FP16, FP8, FP4
Cold-startOptimized for fast initializationStandard startup
QuantizationNot supportedFP8, FP4 supported
Memory usageLower for small modelsHigher or equal
Throughput600-900 embeddings/sec800-1400 embeddings/sec
Best forSmall BERT models, accuracy-criticalLarge models, throughput-critical

When to use BEI-Bert

Choose BEI-Bert when any of these apply:
  • BERT-family base architecture: BertModel, RobertaModel, ModernBertModel, XLMRobertaModel, or a *Bidirectional adapted checkpoint.
  • Cold-start matters: first-request latency is critical for your traffic shape.
  • Small to medium models: under 4B parameters where FP8/FP4 quantization isn’t needed.
  • 16-bit precision: workloads where FP16 accuracy is preferred over quantized throughput.
  • Token-level classification: NER and other /predict_tokens endpoints run on BEI-Bert only.
For models over 4B parameters, causal embedders, or workloads that need FP8/FP4 quantization, use BEI. See the BEI overview.

Supported model families

BEI-Bert runs the following base architectures: BertModel, RobertaModel, ModernBertModel, XLMRobertaModel, Gemma3Bidirectional, Qwen2Bidirectional, Qwen3Bidirectional, LLama3Bidirectional.

Sentence-transformers

The most common BERT-based embedding models, optimized for semantic similarity.
  • sentence-transformers/all-MiniLM-L6-v2 (384D, 22M params)
  • sentence-transformers/all-mpnet-base-v2 (768D, 110M params)
  • sentence-transformers/multi-qa-mpnet-base-dot-v1 (768D, 110M params)

Jina AI

Jina’s BERT-based models for general and code-specific domains.
  • jinaai/jina-embeddings-v2-base-en (512D, 137M params)
  • jinaai/jina-embeddings-v2-base-code (512D, 137M params)
  • jinaai/jina-embeddings-v2-base-es (512D, 137M params)

Nomic AI

Nomic’s models with specialized training for text and code.
  • nomic-ai/nomic-embed-text-v1.5 (768D, 137M params)
  • nomic-ai/nomic-embed-code-v1.5 (768D, 137M params)

Alibaba GTE and Qwen (bidirectional)

Multilingual models with instruction-tuning and long-context support.
  • Alibaba-NLP/gte-Qwen2-7B-instruct (top-ranked multilingual)
  • Alibaba-NLP/gte-Qwen2-1.5B-instruct (cost-effective alternative)
  • intfloat/multilingual-e5-large-instruct

Bidirectional LLM variants

Some Qwen and Llama checkpoints run in bidirectional mode: each token attends to the full input, which often improves embedding quality over causal pooling.
  • Qwen2Bidirectional: Alibaba-NLP/gte-Qwen2-7B-instruct
  • Qwen3Bidirectional: voyageai/voyage-4-nano (contact Baseten for deploy config)
  • Llama3Bidirectional: nvidia/llama-embed-nemotron-8b
Set base_model: encoder_bert. The build applies bidirectional attention automatically.

Checkpoint requirements

BEI-Bert builds standard Hugging Face checkpoints only. Repos that require trust_remote_code fail at build time. Pin checkpoint_repository.revision when the model maintainer publishes a compatible config on a non-default branch. For voyageai/voyage-4-nano, the default Hugging Face branch is not compatible with BEI-Bert. Contact your Baseten representative for the current checkpoint_repository settings before you deploy.

Reranking

BEI-Bert runs cross-encoder rerankers via /rerank. Recommended:
  • BAAI/bge-reranker-large (XLM-RoBERTa)
  • BAAI/bge-reranker-base (XLM-RoBERTa base)
  • Alibaba-NLP/gte-multilingual-reranker-base
  • Alibaba-NLP/gte-reranker-modernbert-base

Classification

BEI-Bert runs sequence classifiers via /predict. The classifier head needs an id2label dictionary in the Hugging Face config. Recommended:
  • SamLowe/roberta-base-go_emotions (sentiment)
  • papluca/xlm-roberta-base-language-detection (language ID)

Named entity recognition

Token-level entity classification routes to /predict_tokens and runs on BEI-Bert only. Recommended:
  • dslim/bert-base-NER-uncased (Truss example)
  • tanaos/tanaos-NER-v1
For the full request/response format and Python example, see Named entity recognition.

Model selection by constraint

Choose based on your primary constraint: Balanced cost and performance:
  • Alibaba-NLP/gte-Qwen2-7B-instruct: instruction-tuned, ranked #1 for multilingual.
  • Alibaba-NLP/gte-Qwen2-1.5B-instruct: 1/5 the size, still top-tier.
  • Snowflake/snowflake-arctic-embed-m-v2.0: multilingual-optimized, MRL support.
Lightweight (under 500M params):
  • google/embeddinggemma-300m: 300M params, 100+ languages.
  • nomic-ai/nomic-embed-text-v1.5: 137M, minimal latency.
  • sentence-transformers/all-MiniLM-L6-v2: 22M, legacy standard.
Specialized:
  • Code: jinaai/jina-embeddings-v2-base-code.
  • Long sequences: Alibaba-NLP/gte-large-en-v1.5.
  • Reranking: BAAI/bge-reranker-large, Alibaba-NLP/gte-reranker-modernbert-base.

Minimal configuration

BEI-Bert deployments set base_model: encoder_bert and quantization_type: no_quant. Pull weights from Hugging Face by default.
trt_llm:
  inference_stack: v1
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "sentence-transformers/all-MiniLM-L6-v2"
    quantization_type: no_quant
  runtime:
    webserver_default_route: /v1/embeddings
For the full schema, including max_num_tokens, GPU support, and complete examples for sentence-transformers, Jina, Nomic, and bidirectional LLM variants, see the BEI configuration reference.