> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

> Production-grade embeddings, reranking, and classification models

Baseten Embeddings Inference (BEI) is Baseten's solution for production-grade inference on embedding, classification, and reranking models using TensorRT-LLM. BEI delivers the lowest latency and highest throughput inference across any embedding solution.

BEI deployments mirror build artifacts to the [Baseten Delivery Network](/development/model/bdn) automatically. No extra configuration is required.

## BEI vs BEI-Bert

BEI comes in two variants, each optimized for different model architectures:

<CardGroup cols={2}>
  <Card title="BEI" href="#bei-features" icon="brain-circuit" iconType="duotone">
    Causal embedding models with quantization support and maximum throughput.
  </Card>

  <Card title="BEI-Bert" href="#bei-bert-features" icon="microchip" iconType="duotone">
    BERT-based models with cold-start optimization, 16-bit precision and bidirectional attention.
  </Card>
</CardGroup>

### BEI features

**Use BEI when:**

* Model uses causal architecture (Llama, Mistral, Qwen for embeddings)
* You need quantization support (FP8, FP4)
* Maximum throughput is required
* Models like BAAI/bge, Qwen3-Embedding, Salesforce/SFR-Embedding

**Benefits:**

* **Quantization Support**: FP8 and FP4 quantization for 2-4x speedup
* **Highest Throughput**: Up to 1400 client embeddings per second
* **XQA Kernels**: Optimized attention kernels for maximum performance
* **Dynamic Batching**: Automatic batch optimization for varying loads

**Supported Architectures:**

* `LlamaModel` (for example, BAAI/bge-multilingual-gemma2)
* `MistralModel` (for example, Salesforce/SFR-Embedding-Mistral)
* `Qwen2Model` (for example, Qwen/Qwen3-Embedding-8B)
* `Gemma2Model` (for example, Google/EmbeddingGemma)

### BEI-Bert features

**Use BEI-Bert when:**

* Model uses BERT-based architecture (sentence-transformers, jinaai, nomic-ai) or generic bidirectional attention models
* You need cold-start optimization for small models (`<4B` parameters)
* 16-bit precision is sufficient for your use case
* Model architectures like Jina-BERT, Nomic, or ModernBERT

**Benefits:**

* **Cold-Start Optimization**: Optimized for fast initialization and small models
* **16-bit Precision**: Models run in FP16 precision
* **BERT Architecture Support**: Specialized optimization for bidirectional models
* **Low Memory Footprint**: Efficient for smaller models and edge deployments

**Supported Architectures:**

* `BertModel` (for example, sentence-transformers/all-MiniLM-L6-v2)
* `RobertaModel` (for example, FacebookAI/roberta-base)
* `Jina-BERT` (for example, jinaai/jina-embeddings-v2-base-en)
* `Nomic-BERT` (for example, nomic-ai/nomic-embed-text-v1.5)
* `Alibaba-GTE` (for example, Alibaba-NLP/gte-large-en-v1.5)
* `Llama Bidirectional` (for example, nvidia/llama-embed-nemotron-8b)

## Model types and use cases

### Embedding models

Embedding models convert text into numerical representations for semantic search, clustering, and retrieval-augmented generation (RAG).

**Examples:**

* **BAAI/bge-large-en-v1.5**: General-purpose English embeddings
* **michaelfeil/Qwen3-Embedding-8B-auto**: Multilingual embeddings with quantization support
* **Salesforce/SFR-Embedding-Mistral**: Instruction-tuned embeddings

**Configuration:**

```yaml theme={"system"}
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "BAAI/bge-large-en-v1.5"
    quantization_type: no_quant  # Supported for causal models
```

### Reranking models

Reranking models are actually classification models that score document relevance for search and retrieval tasks. They work by classifying query-document pairs as relevant or not relevant.

**How rerankers work:**

* Rerankers are sequence classification models (ending with `ForSequenceClassification`)
* They take a query and document as input and output a relevance score
* The "reranking" is accomplished by scoring multiple documents and ranking them by the classification score
* You can implement reranking by using the classification endpoint with proper prompt templates

**Recommended:**

* **BAAI/bge-reranker-v2-m3**: Great reranking model (279M params). Performs well in RAG systems where a first pass of vector retrieval surfaces dozens of snippets of data.
* **michaelfeil/Qwen3-Reranker-8B-seq**: Best multilingual and general-purpose reranker. **Note:** Needs to be used with the `webserver_default_route: /predict` setting.

**Configuration:**

```yaml theme={"system"}
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "BAAI/bge-reranker-v2-m3"
    max_num_tokens: 16384
  runtime:
    webserver_default_route: /rerank
```

**Implementation:**
Use the `/predict` endpoint with proper prompt formatting for query-document pairs. The baseten-performance-client handles reranking template formatting automatically.

### Classification models

Classification models categorize text into predefined classes for tasks like sentiment analysis, content moderation, and language detection.

**Examples:**

* **papluca/xlm-roberta-base-language-detection**: Language identification
* **samlowe/roberta-base-go\_emotions**: Emotion classification
* **Reward Models**: RLHF reward model examples

**Configuration:**

```yaml theme={"system"}
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      source: HF
      repo: "papluca/xlm-roberta-base-language-detection"
    quantization_type: no_quant  # BEI-Bert required for classification models
  runtime:
    webserver_default_route: /predict
```

### Named entity recognition (NER): BEI-Bert only

NER models classify each token in the input text into entity categories such as person (`PER`), organization (`ORG`), location (`LOC`), and miscellaneous (`MISC`). NER models use the `ForTokenClassification` architecture and the `/predict_tokens` endpoint. NER requires **BEI-Bert** (`base_model: encoder_bert`) and is not supported on BEI.

**Recommended:**

* **dslim/bert-base-NER-uncased**: Fast, compact NER model for English. ([truss example](https://github.com/basetenlabs/truss-examples/tree/main/custom-server/BEI-Bert-dslim-bert-base-ner-uncased))
* **tanaos/tanaos-NER-v1**: General-purpose NER model.

**Configuration:**

```yaml theme={"system"}
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: HF
      repo: "baseten-admin/bert-base-ner-uncased"
      revision: main
    max_num_tokens: 16384
  runtime:
    webserver_default_route: /predict_tokens
```

**NER request format:**

```json theme={"system"}
{
  "inputs": [["Apple is looking at buying U.K. startup for $1 billion"]],
  "truncate": true,
  "raw_scores": false,
  "aggregation_strategy": "max"
}
```

| Field                  | Type                    | Description                                                                                                                                                                                                                                                    |
| ---------------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `inputs`               | list of list of strings | Batched text inputs to classify. Each inner list is a batch of texts.                                                                                                                                                                                          |
| `raw_scores`           | boolean                 | When `true`, returns raw logit scores for all labels per token. When `false`, returns only the top predicted label with its probability.                                                                                                                       |
| `truncate`             | boolean                 | Truncates inputs that exceed the model's max sequence length.                                                                                                                                                                                                  |
| `truncation_direction` | string                  | Controls which end is truncated. Defaults to `"Right"`.                                                                                                                                                                                                        |
| `aggregation_strategy` | string                  | Merges sub-word tokens into entity spans. Accepts `"none"`, `"simple"`, `"first"`, `"average"`, or `"max"`. Use `"max"` to match the behavior of `transformers.pipeline("ner", aggregation_strategy="max")`. Omit or use `"none"` for token-level predictions. |

**Response with `aggregation_strategy: "max"`** (recommended for production):

```json theme={"system"}
[
  [
    {"token": "Apple", "token_id": 0, "start": 0, "end": 5, "results": {"ORG": 0.9975586}},
    {"token": "U.K.", "token_id": 0, "start": 27, "end": 31, "results": {"LOC": 0.9980469}}
  ]
]
```

**Response with `aggregation_strategy: "none"` and `raw_scores: true`** (token-level with BIO labels):

```json theme={"system"}
[
  [
    {
      "token": "Apple",
      "token_id": 6207,
      "start": 0,
      "end": 5,
      "results": {
        "B-ORG": 6.7578125,
        "O": -1.7929688,
        "B-LOC": 0.6015625,
        "B-MISC": 0.2467041,
        "B-PER": 0.17675781,
        "I-ORG": -0.6484375,
        "I-MISC": -1.9873047,
        "I-LOC": -1.3808594,
        "I-PER": -2.21875
      }
    }
  ]
]
```

Token-level labels follow the [BIO tagging scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_\(tagging\)): `B-` marks the beginning of an entity, `I-` marks a continuation, and `O` means outside any entity.

**Python example (using Baseten Performance Client):**

```python theme={"system"}
from baseten_performance_client import PerformanceClient
import os

client = PerformanceClient(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync"
)

response = client.batch_post(
    route="/predict_tokens",
    payloads=[{
        "inputs": [["Apple is looking at buying U.K. startup for $1 billion"]],
        "truncate": True,
        "raw_scores": False,
        "aggregation_strategy": "max"
    }]
)

for entity in response.data[0]:
    label = next(iter(entity["results"]))
    score = entity["results"][label]
    print(f"{entity['token']}: {label} ({score:.4f})")
```

NER models do not have an OpenAI-compatible endpoint. Use the `/predict_tokens` route directly. The `/predict_tokens` route also supports [async inference](/inference/async).

## Performance and optimization

### Throughput benchmarks

For detailed performance benchmarks, see: [Run Qwen3 Embedding on NVIDIA Blackwell GPUs](https://www.baseten.co/blog/run-qwen3-embedding-on-nvidia-blackwell-gpus/#bei-provides-the-fastest-embeddings-inference-on-b200s)

| Framework | Precision | GPU  | Max Token/s Throughput | Max Request/s Throughput |
| --------- | --------- | ---- | ---------------------- | ------------------------ |
| TEI       | FP16      | H100 | 34,055                 | 824.25                   |
| BEI-Bert  | FP16      | H100 | 36,520                 | 841.05                   |
| vLLM      | BF16      | H100 | 36,625                 | 155.23                   |
| BEI       | BF16      | H100 | 47,549                 | 761.44                   |
| BEI       | FP8       | H100 | 77,107                 | 855.96                   |
| BEI       | FP8       | B200 | 121,443                | 1,310.52                 |

* **Token Throughput/s**: Measured on 500 tokens per request
* **Request Throughput/s**: Measured on 5 tokens per request

### Quantization impact

| **Quantization** | **Speed Improvement** | **Memory Reduction** | **Accuracy Impact** |
| ---------------- | --------------------- | -------------------- | ------------------- |
| FP16/BF16 vLLM   | Baseline              | None                 | None                |
| FP16/BF16 BEI    | 1.3x                  | None                 | None                |
| FP8 BEI          | 2x faster             | 50%                  | \~1%                |
| FP4 BEI          | 3.5x faster           | 75%                  | 1-2%                |

### Hardware requirements

| **GPU Type** | **BEI Support** | **BEI-Bert Support** | **Recommended For**        |
| ------------ | --------------- | -------------------- | -------------------------- |
| L4           | Full            | Full                 | Cost-effective deployments |
| A10G, A100   | Full            | Full                 | Legacy support             |
| T4           | No              | Full                 | Legacy support             |
| H100         | Full            | Full                 | Maximum performance        |
| B200         | Full            | Full                 | FP4 quantization           |

## OpenAI compatibility

BEI deployments are fully OpenAI compatible for embeddings:

```python theme={"system"}
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

embedding = client.embeddings.create(
    input=["Baseten Embeddings are fast.", "Embed this sentence!"],
    model="not-required"
)
```

### Baseten Performance Client

For maximum throughput, use the [Baseten Performance Client](/inference/performance-client).

```python theme={"system"}
from baseten_performance_client import PerformanceClient

client = PerformanceClient(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync"
)

texts = ["Hello world", "Example text", "Another sample"]
response = client.embed(
    input=texts,
    model="my_model",
    batch_size=4,
    max_concurrent_requests=32,
    timeout_s=360
)
```

## Reference config

For complete configuration options, see the [BEI reference config](/engines/bei/bei-reference).

### Key configuration options

```yaml theme={"system"}
trt_llm:
  build:
    base_model: encoder  # or encoder_bert for BEI-Bert
    checkpoint_repository:
      source: HF  # or GCS, S3, AZURE, REMOTE_URL
      repo: "model-repo-name"
      revision: main
      runtime_secret_name: hf_access_token
    max_num_tokens: 16384  # BEI automatically upgrades to 16384
    quantization_type: fp8  # or no_quant for BEI-Bert
  runtime:
    webserver_default_route: /v1/embeddings  # or /rerank, /predict
```

## Production best practices

### GPU selection guidelines

* **L4**: Best for models `<4B` parameters, cost-effective
* **H100**: Required for models 4B+ parameters or long context (>8K tokens)
* **H100\_40GB**: Use for models with memory constraints

### Build job optimization

```yaml theme={"system"}
# H100 builds (default)
trt_llm:
  build:
    num_builder_gpus: 2

# L4 builds (memory-constrained)
trt_llm:
  build:
    num_builder_gpus: 4
```

### Model-specific recommendations

**BERT-based models (BEI-Bert):**

* Use `encoder_bert` base model
* No quantization support (FP16/BF16 only)
* Best for models `<200M` parameters on L4

**ModernBERT and newer architectures:**

* Support longer contexts (up to 8192 tokens)
* Use H100 for models >1B parameters
* Consider memory requirements for long sequences

**Qwen embedding models:**

* Use regular FP8 quantization
* Support very long contexts (up to 131K tokens)
* Higher memory requirements for long sequences

### Token limit optimization

```yaml theme={"system"}
trt_llm:
  build:
    max_num_tokens: 16384  # Default, automatically set by BEI
    # Override for specific use cases:
    # max_num_tokens: 8192   # Standard embeddings
    # max_num_tokens: 131072  # Qwen long-context models
```

## Getting started

1. **Choose your variant**: BEI for causal models and quantization, BEI-Bert for BERT models
2. **Review configuration**: See [BEI reference config](/engines/bei/bei-reference)
3. **Deploy your model**: Use the configuration templates and examples
4. **Test integration**: Use OpenAI client or Performance Client for maximum throughput

## Examples and further reading

* [BEI-Bert examples](/engines/bei/bei-bert) - BERT-specific configurations
* [BEI reference config](/engines/bei/bei-reference) - Complete configuration options
* [Embedding examples](/examples/bei) - Concrete deployment examples
* [Performance client documentation](/inference/performance-client) - Client Usage with Embeddings
