Skip to main content
Baseten Embeddings Inference (BEI) is Baseten’s solution for production-grade inference on embedding, classification, and reranking models using TensorRT-LLM. BEI delivers the lowest latency and highest throughput inference across any embedding solution.

BEI vs BEI-Bert

BEI comes in two variants, each optimized for different model architectures:

BEI features

Use BEI when:
  • Model uses causal architecture (Llama, Mistral, Qwen for embeddings)
  • You need quantization support (FP8, FP4)
  • Maximum throughput is required
  • Models like BAAI/bge, Qwen3-Embedding, Salesforce/SFR-Embedding
Benefits:
  • Quantization Support: FP8 and FP4 quantization for 2-4x speedup
  • Highest Throughput: Up to 1400 client embeddings per second
  • XQA Kernels: Optimized attention kernels for maximum performance
  • Dynamic Batching: Automatic batch optimization for varying loads
Supported Architectures:
  • LlamaModel (e.g., BAAI/bge-multilingual-gemma2)
  • MistralModel (e.g., Salesforce/SFR-Embedding-Mistral)
  • Qwen2Model (e.g., Qwen/Qwen3-Embedding-8B)
  • Gemma2Model (e.g., Google/EmbeddingGemma)

BEI-Bert features

Use BEI-Bert when:
  • Model uses BERT-based architecture (sentence-transformers, jinaai, nomic-ai) or generic bidirectional attention models
  • You need cold-start optimization for small models (<4B parameters)
  • 16-bit precision is sufficient for your use case
  • Model architectures like Jina-BERT, Nomic, or ModernBERT
Benefits:
  • Cold-Start Optimization: Optimized for fast initialization and small models
  • 16-bit Precision: Models run in FP16/BF16 precision
  • BERT Architecture Support: Specialized optimization for bidirectional models
  • Low Memory Footprint: Efficient for smaller models and edge deployments
Supported Architectures:
  • BertModel (e.g., sentence-transformers/all-MiniLM-L6-v2)
  • RobertaModel (e.g., FacebookAI/roberta-base)
  • Jina-BERT (e.g., jinaai/jina-embeddings-v2-base-en)
  • Nomic-BERT (e.g., nomic-ai/nomic-embed-text-v1.5)

Model types and use cases

Embedding models

Embedding models convert text into numerical representations for semantic search, clustering, and retrieval-augmented generation (RAG). Examples:
  • BAAI/bge-large-en-v1.5: General-purpose English embeddings
  • michaelfeil/Qwen3-Embedding-8B-auto: Multilingual embeddings with quantization support
  • Salesforce/SFR-Embedding-Mistral: Instruction-tuned embeddings
Configuration:
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "BAAI/bge-large-en-v1.5"
    quantization_type: no_quant  # Supported for causal models

Reranking models

Reranking models are actually classification models that score document relevance for search and retrieval tasks. They work by classifying query-document pairs as relevant or not relevant. How rerankers work:
  • Rerankers are sequence classification models (ending with ForSequenceClassification)
  • They take a query and document as input and output a relevance score
  • The “reranking” is accomplished by scoring multiple documents and ranking them by the classification score
  • You can implement reranking by using the classification endpoint with proper prompt templates
Recommended:
  • BAAI/bge-reranker-v2-m3: Great reranking model (279M params). Performs well in RAG systems where a first pass of vector retrieval surfaces dozens of snippets of data.
  • michaelfeil/Qwen3-Reranker-8B-seq: Best multilingual and general purpose reranker. Note: Needs to be used with the webserver_default_route: /predict setting.
Configuration:
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "BAAI/bge-reranker-v2-m3"
    max_num_tokens: 16384
  runtime:
    webserver_default_route: /rerank
Implementation: Use the /predict endpoint with proper prompt formatting for query-document pairs. The baseten-performance-client handles reranking template formatting automatically.

Classification models

Classification models categorize text into predefined classes for tasks like sentiment analysis, content moderation, and language detection. Examples:
  • papluca/xlm-roberta-base-language-detection: Language identification
  • samlowe/roberta-base-go_emotions: Emotion classification
  • Reward Models: RLHF reward model examples
Configuration:
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "papluca/xlm-roberta-base-language-detection"
    quantization_type: no_quant  # BEI-Bert for classification
  runtime:
    webserver_default_route: /predict

Performance and optimization

Throughput benchmarks

For detailed performance benchmarks, see: Run Qwen3 Embedding on NVIDIA Blackwell GPUs
FrameworkPrecisionGPUMax Token/s ThroughputMax Request/s Throughput
TEIBF16H10034,055824.25
vLLMBF16H10036,625155.23
BEIBF16H10047,549761.44
BEIFP8H10077,107855.96
BEIFP8B200121,4431,310.52
  • Token Throughput/s: Measured on 500 tokens per request
  • Request Throughput/s: Measured on 5 tokens per request

Quantization impact

QuantizationSpeed ImprovementMemory ReductionAccuracy Impact
FP16/BF16 vLLMBaselineNoneNone
FP16/BF16 BEI1.3xNoneNone
FP8 BEI2x faster50%~1%
FP4 BEI3.5x faster75%1-2%

Hardware requirements

GPU TypeBEI SupportBEI-Bert SupportRecommended For
L4FullFullCost-effective deployments
A10G, A100FullFullLegacy support
T4NoSupportLegacy support
H100FullFullMaximum performance
B200FullFullFp4 quantization

OpenAI compatibility

BEI deployments are fully OpenAI compatible for embeddings:
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

embedding = client.embeddings.create(
    input=["Baseten Embeddings are fast.", "Embed this sentence!"],
    model="not-required"
)

Baseten Performance Client

For maximum throughput, use the Baseten Performance Client.
from baseten_performance_client import PerformanceClient

client = PerformanceClient(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync"
)

texts = ["Hello world", "Example text", "Another sample"]
response = client.embed(
    input=texts,
    model="my_model",
    batch_size=4,
    max_concurrent_requests=32,
    timeout_s=360
)

Reference config

For complete configuration options, see the BEI reference config.

Key configuration options

trt_llm:
  build:
    base_model: encoder  # or encoder_bert for BEI-Bert
    checkpoint_repository:
      source: HF  # or GCS, S3, AZURE, REMOTE_URL
      repo: "model-repo-name"
      revision: main
      runtime_secret_name: hf_access_token
    max_num_tokens: 16384  # BEI automatically upgrades to 16384
    quantization_type: fp8  # or no_quant for BEI-Bert
  runtime:
    webserver_default_route: /v1/embeddings  # or /rerank, /predict

Production best practices

GPU selection guidelines

  • L4: Best for models <4B parameters, cost-effective
  • H100: Required for models 4B+ parameters or long context (>8K tokens)
  • H100_40GB: Use for models with memory constraints

Build job optimization

# H100 builds (default)
trt_llm:
  build:
    num_builder_gpus: 2

# L4 builds (memory-constrained)
trt_llm:
  build:
    num_builder_gpus: 4

Model-specific recommendations

BERT-based models (BEI-Bert):
  • Use encoder_bert base model
  • No quantization support (FP16/BF16 only)
  • Best for models <200M parameters on L4
ModernBERT and newer architectures:
  • Support longer contexts (up to 8192 tokens)
  • Use H100 for models >1B parameters
  • Consider memory requirements for long sequences
Qwen embedding models:
  • Use regular FP8 quantization
  • Support very long contexts (up to 131K tokens)
  • Higher memory requirements for long sequences

Token limit optimization

trt_llm:
  build:
    max_num_tokens: 16384  # Default, automatically set by BEI
    # Override for specific use cases:
    # max_num_tokens: 8192   # Standard embeddings
    # max_num_tokens: 131072  # Qwen long-context models

Getting started

  1. Choose your variant: BEI for causal models and quantization, BEI-Bert for BERT models
  2. Review configuration: See BEI reference config
  3. Deploy your model: Use the configuration templates and examples
  4. Test integration: Use OpenAI client or Performance Client for maximum throughput

Examples and further reading