Skip to main content
BEI-Bert is a specialized variant of Baseten Embeddings Inference optimized for BERT-based model architectures. It provides superior cold-start performance and 16-bit precision for models that benefit from bidirectional attention patterns.

When to use BEI-Bert

Ideal use cases

Model architectures:
  • Sentence-transformers: sentence-transformers/all-MiniLM-L6-v2
  • Jina models: jinaai/jina-embeddings-v2-base-en, jinaai/jina-embeddings-v2-base-code
  • Nomic models: nomic-ai/nomic-embed-text-v1.5, nomic-ai/nomic-embed-code-v1.5
  • BERT variants: FacebookAI/roberta-base, cardiffnlp/twitter-roberta-base
  • Gemma3Bidirectional: google/embeddinggemma-300m
  • ModernBERT: answerdotai/ModernBERT-base
  • Qwen2Bidirectional: Alibaba-NLP/gte-Qwen2-7B-instruct
Deployment scenarios:
  • Cold-start sensitive applications: Where first-request latency is critical
  • Small to medium models: (<4B parameters) where quantization isn’t needed
  • High-accuracy requirements: Where 16-bit precision is preferred
  • bidirectional attention: Models that have directional attention run best on this engine.

BEI-Bert vs BEI comparison

FeatureBEI-BertBEI
ArchitectureBERT-based (bidirectional)Causal (unidirectional)
PrecisionFP16 (16-bit)BF16/FP16/FP8/FP4 (quantized)
Cold-startOptimized for fast initializationStandard startup
QuantizationNot supportedFP8/FP4 supported
Memory UsageLower for small modelsHigher or equal
Throughput600-900 embeddings/sec800-1400 embeddings/sec
Best ForSmall BERT models, accuracy-criticalLarge models, throughput-critical

Supported model families

Sentence-transformers

The most common BERT-based embedding models, optimized for semantic similarity. Popular models:
  • sentence-transformers/all-MiniLM-L6-v2 (384D, 22M params)
  • sentence-transformers/all-mpnet-base-v2 (768D, 110M params)
  • sentence-transformers/multi-qa-mpnet-base-dot-v1 (768D, 110M params)
Configuration:
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "sentence-transformers/all-MiniLM-L6-v2"
    quantization_type: no_quant
  runtime:
    webserver_default_route: /v1/embeddings

Jina AI embeddings

Jina’s BERT-based models optimized for various domains including code. Popular models:
  • jinaai/jina-embeddings-v2-base-en (512D, 137M params)
  • jinaai/jina-embeddings-v2-base-code (512D, 137M params)
  • jinaai/jina-embeddings-v2-base-es (512D, 137M params)
Configuration:
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "jinaai/jina-embeddings-v2-base-en"
    quantization_type: no_quant
  runtime:
    webserver_default_route: /v1/embeddings

Nomic AI embeddings

Nomic’s models with specialized training for text and code. Popular models:
  • nomic-ai/nomic-embed-text-v1.5 (768D, 137M params)
  • nomic-ai/nomic-embed-code-v1.5 (768D, 137M params)
Configuration:
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "nomic-ai/nomic-embed-text-v1.5"
    quantization_type: no_quant
  runtime:
    webserver_default_route: /v1/embeddings

RoBERTa and variants

Facebook AI’s RoBERTa and other BERT variants for specific domains. Popular models:
  • FacebookAI/roberta-base (768D, 125M params)
  • cardiffnlp/twitter-roberta-base (768D, 125M params)
Configuration:
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "FacebookAI/roberta-base"
    quantization_type: no_quant
  runtime:
    webserver_default_route: /v1/embeddings

Configuration examples

Basic sentence-transformer deployment

model_name: BEI-Bert-MiniLM
resources:
  accelerator: L4
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "sentence-transformers/all-MiniLM-L6-v2"
      revision: main
    max_num_tokens: 8192
    quantization_type: no_quant
    plugin_configuration:
      paged_kv_cache: false
      use_paged_context_fmha: false
      use_fp8_context_fmha: false
  runtime:
    webserver_default_route: /v1/embeddings
    kv_cache_free_gpu_mem_fraction: 0.9
    batch_scheduler_policy: guaranteed_no_evict

Jina code embeddings deployment

model_name: BEI-Bert-Jina-Code
resources:
  accelerator: H100
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "jinaai/jina-embeddings-v2-base-code"
      revision: main
    max_num_tokens: 8192
    quantization_type: no_quant
  runtime:
    webserver_default_route: /v1/embeddings
    kv_cache_free_gpu_mem_fraction: 0.9
    batch_scheduler_policy: guaranteed_no_evict

Nomic text embeddings with custom routing

model_name: BEI-Bert-Nomic-Text
resources:
  accelerator: L4
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "nomic-ai/nomic-embed-text-v1.5"
      revision: main
    max_num_tokens: 16384
    quantization_type: no_quant
  runtime:
    webserver_default_route: /v1/embeddings
    kv_cache_free_gpu_mem_fraction: 0.85
    batch_scheduler_policy: guaranteed_no_evict

Integration examples

OpenAI client usage

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

# Basic embedding
response = client.embeddings.create(
    input="This is a test sentence for embedding.",
    model="not-required"
)

# Batch embedding
response = client.embeddings.create(
    input=[
        "First sentence to embed.",
        "Second sentence to embed.",
        "Third sentence to embed."
    ],
    model="not-required"
)

print(f"Embedding dimension: {len(response.data[0].embedding)}")
print(f"Number of embeddings: {len(response.data)}")

Baseten Performance Client

For maximum throughput with BEI-Bert:
from baseten_performance_client import PerformanceClient

client = PerformanceClient(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync"
)

# High-throughput batch processing
texts = [f"Sentence {i}" for i in range(1000)]
response = client.embed(
    input=texts,
    model="not-required",
    batch_size=8,
    max_concurrent_requests=16,
    timeout_s=300
)

print(f"Processed {len(response.numpy())} embeddings")
print(f"Embedding shape: {response.numpy().shape}")

Direct API usage

import requests
import os
import json

headers = {
    "Authorization": f"Api-Key {os.environ['BASETEN_API_KEY']}",
    "Content-Type": "application/json"
}

data = {
    "input": ["Text to embed", "Another text"],
    "encoding_format": "float"
}

response = requests.post(
    "https://model-xxxxxx.api.baseten.co/environments/production/sync/v1/embeddings",
    headers=headers,
    json=data
)

result = response.json()
print(f"Embeddings: {len(result['data'])} embeddings generated")

Best practices

Model selection

For general purpose:
  • Use sentence-transformers/all-MiniLM-L6-v2 for balance of speed and quality
  • Use sentence-transformers/all-mpnet-base-v2 for higher quality
For code embeddings:
  • Use jinaai/jina-embeddings-v2-base-code for general code
  • Use nomic-ai/nomic-embed-code-v1.5 for specialized code tasks
For multilingual:
  • Use sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
  • Use jinaai/jina-embeddings-v2-base-es for Spanish

Hardware optimization

Cost-effective deployments:
  • L4 GPUs for models <200M parameters
  • H100 GPUs for models 200-500M parameters
  • Enable autoscaling for variable traffic
Performance optimization:
  • Use max_num_tokens: 8192 for most use cases
  • Use max_num_tokens: 16384 for long documents
  • Tune batch_scheduler_policy based on traffic patterns

Deployment strategies

For development:
  • Start with smaller models (MiniLM)
  • Use L4 GPUs for cost efficiency
  • Enable detailed logging
For production:
  • Use larger models (MPNet) for better quality
  • Use H100 GPUs for better performance
  • Implement monitoring and alerting
For edge deployments:
  • Use smallest suitable models
  • Optimize for cold-start performance
  • Consider model size constraints

Troubleshooting

Common issues

Slow cold-start times:
  • Ensure model is properly cached
  • Consider using smaller models
  • Check GPU memory availability
Lower than expected throughput:
  • Verify max_num_tokens is appropriate
  • Check batch_scheduler_policy settings
  • Monitor GPU utilization
Memory issues:
  • Reduce max_num_tokens if needed
  • Use smaller models for available memory
  • Monitor memory usage during deployment

Performance tuning

For lower latency:
  • Reduce max_num_tokens
  • Use batch_scheduler_policy: guaranteed_no_evict
  • Consider smaller models
For higher throughput:
  • Increase max_num_tokens appropriately
  • Use batch_scheduler_policy: max_utilization
  • Optimize batch sizes in client code
For cost optimization:
  • Use L4 GPUs when possible
  • Choose appropriately sized models
  • Implement efficient autoscaling

Migration from other systems

From sentence-transformers library

Python code:
# Before (sentence-transformers)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

# After (BEI-Bert)
from openai import OpenAI
client = OpenAI(api_key=BASETEN_API_KEY, base_url=BASE_URL)
embeddings = client.embeddings.create(input=sentences, model="not-required")

From other embedding services

BEI-Bert provides OpenAI-compatible endpoints, making migration straightforward:
  1. Update base URL: Point to Baseten deployment
  2. Update API key: Use Baseten API key
  3. Test compatibility: Verify embedding dimensions and quality
  4. Optimize: Tune batch sizes and concurrency for performance

Further reading