Skip to main content
Baseten Embeddings Inference (BEI) is Baseten’s solution for production-grade inference on embedding, classification, and reranking models using TensorRT-LLM. BEI delivers the lowest latency and highest throughput inference across any embedding solution.

BEI vs BEI-Bert

BEI comes in two variants, each optimized for different model architectures:

BEI

Causal embedding models with quantization support and maximum throughput.

BEI-Bert

BERT-based models with cold-start optimization, 16-bit precision and bidirectional attention.

BEI features

Use BEI when:
  • Model uses causal architecture (Llama, Mistral, Qwen for embeddings)
  • You need quantization support (FP8, FP4)
  • Maximum throughput is required
  • Models like BAAI/bge, Qwen3-Embedding, Salesforce/SFR-Embedding
Benefits:
  • Quantization Support: FP8 and FP4 quantization for 2-4x speedup
  • Highest Throughput: Up to 1400 client embeddings per second
  • XQA Kernels: Optimized attention kernels for maximum performance
  • Dynamic Batching: Automatic batch optimization for varying loads
Supported Architectures:
  • LlamaModel (e.g., BAAI/bge-multilingual-gemma2)
  • MistralModel (e.g., Salesforce/SFR-Embedding-Mistral)
  • Qwen2Model (e.g., Qwen/Qwen3-Embedding-8B)
  • Gemma2Model (e.g., Google/EmbeddingGemma)

BEI-Bert features

Use BEI-Bert when:
  • Model uses BERT-based architecture (sentence-transformers, jinaai, nomic-ai) or generic bidirectional attention models
  • You need cold-start optimization for small models (<4B parameters)
  • 16-bit precision is sufficient for your use case
  • Model architectures like Jina-BERT, Nomic, or ModernBERT
Benefits:
  • Cold-Start Optimization: Optimized for fast initialization and small models
  • 16-bit Precision: Models run in FP16 precision
  • BERT Architecture Support: Specialized optimization for bidirectional models
  • Low Memory Footprint: Efficient for smaller models and edge deployments
Supported Architectures:
  • BertModel (e.g., sentence-transformers/all-MiniLM-L6-v2)
  • RobertaModel (e.g., FacebookAI/roberta-base)
  • Jina-BERT (e.g., jinaai/jina-embeddings-v2-base-en)
  • Nomic-BERT (e.g., nomic-ai/nomic-embed-text-v1.5)
  • Alibaba-GTE (e.g., Alibaba-NLP/gte-large-en-v1.5)
  • Llama Bidirectional (e.g., nvidia/llama-embed-nemotron-8b)

Model types and use cases

Embedding models

Embedding models convert text into numerical representations for semantic search, clustering, and retrieval-augmented generation (RAG). Examples:
  • BAAI/bge-large-en-v1.5: General-purpose English embeddings
  • michaelfeil/Qwen3-Embedding-8B-auto: Multilingual embeddings with quantization support
  • Salesforce/SFR-Embedding-Mistral: Instruction-tuned embeddings
Configuration:
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "BAAI/bge-large-en-v1.5"
    quantization_type: no_quant  # Supported for causal models

Reranking models

Reranking models are actually classification models that score document relevance for search and retrieval tasks. They work by classifying query-document pairs as relevant or not relevant. How rerankers work:
  • Rerankers are sequence classification models (ending with ForSequenceClassification)
  • They take a query and document as input and output a relevance score
  • The “reranking” is accomplished by scoring multiple documents and ranking them by the classification score
  • You can implement reranking by using the classification endpoint with proper prompt templates
Recommended:
  • BAAI/bge-reranker-v2-m3: Great reranking model (279M params). Performs well in RAG systems where a first pass of vector retrieval surfaces dozens of snippets of data.
  • michaelfeil/Qwen3-Reranker-8B-seq: Best multilingual and general-purpose reranker. Note: Needs to be used with the webserver_default_route: /predict setting.
Configuration:
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "BAAI/bge-reranker-v2-m3"
    max_num_tokens: 16384
  runtime:
    webserver_default_route: /rerank
Implementation: Use the /predict endpoint with proper prompt formatting for query-document pairs. The baseten-performance-client handles reranking template formatting automatically.

Classification models

Classification models categorize text into predefined classes for tasks like sentiment analysis, content moderation, and language detection. Examples:
  • papluca/xlm-roberta-base-language-detection: Language identification
  • samlowe/roberta-base-go_emotions: Emotion classification
  • Reward Models: RLHF reward model examples
Configuration:
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "papluca/xlm-roberta-base-language-detection"
    quantization_type: no_quant  # BEI-Bert required for classification models
  runtime:
    webserver_default_route: /predict

Named entity recognition (NER): BEI-Bert only

NER models classify each token in the input text into entity categories such as person (PER), organization (ORG), location (LOC), and miscellaneous (MISC). NER models use the ForTokenClassification architecture and the /predict_tokens endpoint. NER requires BEI-Bert (base_model: encoder_bert) and is not supported on BEI. Recommended:
  • dslim/bert-base-NER-uncased: Fast, compact NER model for English. (truss example)
  • tanaos/tanaos-NER-v1: General-purpose NER model.
Configuration:
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "baseten-admin/bert-base-ner-uncased"
      revision: main
    max_num_tokens: 16384
  runtime:
    webserver_default_route: /predict_tokens
NER request format:
{
  "inputs": [["Apple is looking at buying U.K. startup for $1 billion"]],
  "truncate": true,
  "raw_scores": false,
  "aggregation_strategy": "max"
}
FieldTypeDescription
inputslist of list of stringsBatched text inputs to classify. Each inner list is a batch of texts.
raw_scoresbooleanWhen true, returns raw logit scores for all labels per token. When false, returns only the top predicted label with its probability.
truncatebooleanTruncates inputs that exceed the model’s max sequence length.
truncation_directionstringControls which end is truncated. Defaults to "Right".
aggregation_strategystringMerges sub-word tokens into entity spans. Accepts "none", "simple", "first", "average", or "max". Use "max" to match the behavior of transformers.pipeline("ner", aggregation_strategy="max"). Omit or use "none" for token-level predictions.
Response with aggregation_strategy: "max" (recommended for production):
[
  [
    {"token": "Apple", "token_id": 0, "start": 0, "end": 5, "results": {"ORG": 0.9975586}},
    {"token": "U.K.", "token_id": 0, "start": 27, "end": 31, "results": {"LOC": 0.9980469}}
  ]
]
Response with aggregation_strategy: "none" and raw_scores: true (token-level with BIO labels):
[
  [
    {
      "token": "Apple",
      "token_id": 6207,
      "start": 0,
      "end": 5,
      "results": {
        "B-ORG": 6.7578125,
        "O": -1.7929688,
        "B-LOC": 0.6015625,
        "B-MISC": 0.2467041,
        "B-PER": 0.17675781,
        "I-ORG": -0.6484375,
        "I-MISC": -1.9873047,
        "I-LOC": -1.3808594,
        "I-PER": -2.21875
      }
    }
  ]
]
Token-level labels follow the BIO tagging scheme: B- marks the beginning of an entity, I- marks a continuation, and O means outside any entity. Python example (using Baseten Performance Client):
from baseten_performance_client import PerformanceClient
import os

client = PerformanceClient(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync"
)

response = client.batch_post(
    route="/predict_tokens",
    payloads=[{
        "inputs": [["Apple is looking at buying U.K. startup for $1 billion"]],
        "truncate": True,
        "raw_scores": False,
        "aggregation_strategy": "max"
    }]
)

for entity in response.data[0]:
    label = next(iter(entity["results"]))
    score = entity["results"][label]
    print(f"{entity['token']}: {label} ({score:.4f})")
NER models do not have an OpenAI-compatible endpoint. Use the /predict_tokens route directly. The /predict_tokens route also supports async inference.

Performance and optimization

Throughput benchmarks

For detailed performance benchmarks, see: Run Qwen3 Embedding on NVIDIA Blackwell GPUs
FrameworkPrecisionGPUMax Token/s ThroughputMax Request/s Throughput
TEIFP16H10034,055824.25
BEI-BertFP16H10036,520841.05
vLLMBF16H10036,625155.23
BEIBF16H10047,549761.44
BEIFP8H10077,107855.96
BEIFP8B200121,4431,310.52
  • Token Throughput/s: Measured on 500 tokens per request
  • Request Throughput/s: Measured on 5 tokens per request

Quantization impact

QuantizationSpeed ImprovementMemory ReductionAccuracy Impact
FP16/BF16 vLLMBaselineNoneNone
FP16/BF16 BEI1.3xNoneNone
FP8 BEI2x faster50%~1%
FP4 BEI3.5x faster75%1-2%

Hardware requirements

GPU TypeBEI SupportBEI-Bert SupportRecommended For
L4FullFullCost-effective deployments
A10G, A100FullFullLegacy support
T4NoFullLegacy support
H100FullFullMaximum performance
B200FullFullFP4 quantization

OpenAI compatibility

BEI deployments are fully OpenAI compatible for embeddings:
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

embedding = client.embeddings.create(
    input=["Baseten Embeddings are fast.", "Embed this sentence!"],
    model="not-required"
)

Baseten Performance Client

For maximum throughput, use the Baseten Performance Client.
from baseten_performance_client import PerformanceClient

client = PerformanceClient(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync"
)

texts = ["Hello world", "Example text", "Another sample"]
response = client.embed(
    input=texts,
    model="my_model",
    batch_size=4,
    max_concurrent_requests=32,
    timeout_s=360
)

Reference config

For complete configuration options, see the BEI reference config.

Key configuration options

trt_llm:
  build:
    base_model: encoder  # or encoder_bert for BEI-Bert
    checkpoint_repository:
      source: HF  # or GCS, S3, AZURE, REMOTE_URL
      repo: "model-repo-name"
      revision: main
      runtime_secret_name: hf_access_token
    max_num_tokens: 16384  # BEI automatically upgrades to 16384
    quantization_type: fp8  # or no_quant for BEI-Bert
  runtime:
    webserver_default_route: /v1/embeddings  # or /rerank, /predict

Production best practices

GPU selection guidelines

  • L4: Best for models <4B parameters, cost-effective
  • H100: Required for models 4B+ parameters or long context (>8K tokens)
  • H100_40GB: Use for models with memory constraints

Build job optimization

# H100 builds (default)
trt_llm:
  build:
    num_builder_gpus: 2

# L4 builds (memory-constrained)
trt_llm:
  build:
    num_builder_gpus: 4

Model-specific recommendations

BERT-based models (BEI-Bert):
  • Use encoder_bert base model
  • No quantization support (FP16/BF16 only)
  • Best for models <200M parameters on L4
ModernBERT and newer architectures:
  • Support longer contexts (up to 8192 tokens)
  • Use H100 for models >1B parameters
  • Consider memory requirements for long sequences
Qwen embedding models:
  • Use regular FP8 quantization
  • Support very long contexts (up to 131K tokens)
  • Higher memory requirements for long sequences

Token limit optimization

trt_llm:
  build:
    max_num_tokens: 16384  # Default, automatically set by BEI
    # Override for specific use cases:
    # max_num_tokens: 8192   # Standard embeddings
    # max_num_tokens: 131072  # Qwen long-context models

Getting started

  1. Choose your variant: BEI for causal models and quantization, BEI-Bert for BERT models
  2. Review configuration: See BEI reference config
  3. Deploy your model: Use the configuration templates and examples
  4. Test integration: Use OpenAI client or Performance Client for maximum throughput

Examples and further reading