Baseten Embeddings Inference is Baseten’s solution for production grade inference on embedding, classification and reranking models using TensorRT-LLM.

With Baseten Embeddings Inference you get the following benefits:

  • Lowest-latency inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)1
  • Highest-throughput inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.2
  • High parallelism: up to 1400 client embeddings per second
  • Cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime
  • Ahead-of-time compilation, memory allocation and fp8 post-training quantization

Getting started with embedding models:

Embedding models are LLMs without a lm_head for language generation. Typical architectures that are supported for embeddings are LlamaModel, BertModel, RobertaModel or Gemma2Model, and contain the safetensors, config, tokenizer and sentence-transformer config files. A good example is the repo BAAI/bge-multilingual-gemma2.

To deploy a model for embeddings, set the following config in your local directory.

config.yaml
model_name: BEI-mixedbread-rerank-large-v2-fp8
resources:
  accelerator: H100
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      # for a different model, change the repo to e.g. to "Salesforce/SFR-Embedding-Mistral"
      # or "Linq-AI-Research/Linq-Embed-Mistral"
      repo: "BAAI/bge-en-icl"
      revision: main
      source: HF
    # only Llama, Mistral and Qwen Models support quantization
    quantization_type: fp8

With config.yaml in your local directory, you can deploy the model to Baseten.

truss push --publish --promote

Deployed embedding models are OpenAI compatible without any additional settings. You may use the client code below to consume the model.

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    # add the deployment URL
    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

embedding = client.embeddings.create(
    input=["Baseten Embeddings are fast.", "Embed this sentence!"],
    model="not-required"
)

Example deployment of a classification, reranking and classification models

Besides embedding models, BEI deploys high-throughput rerank and classification models. You can identify suitable architectures by their ForSequenceClassification suffix in the huggingface repo. The use-case for these models is either Reward Modeling, Reranking documents in RAG or tasks like content moderation.

model_name: BEI-mixedbread-rerank-large-v2-fp8
resources:
  accelerator: H100
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/mxbai-rerank-large-v2-seq
      revision: main
      source: HF
    # only Llama, Mistral and Qwen Models support quantization
    quantization_type: fp8

As OpenAI does not offer reranking or classification, we are sending a simple request to the endpoint. Depending on the model, you might want to apply a specific prompt template first.

import requests
import os

headers = {
    f"Authorization": f"Api-Key {os.environ['BASETEN_API_KEY']}"
}

# model specific prompt for mixedbread's reranker v2.
prompt = (
  "<|endoftext|><|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n<|im_end|>\n<|im_start|>user\n"
  "query: {query} \ndocument: {doc} \nYou are a search relevance expert who evaluates how well documents match search queries. For each query-document pair, carefully analyze the semantic relationship between them, then provide your binary relevance judgment (0 for not relevant, 1 for relevant).\nRelevance:<|im_end|>\n<|im_start|>assistant\n"
).format(query="What is Baseten?",doc="Baseten is a fast inference provider.")

requests.post(
    headers=headers,
    url="https://model-xxxxxx.api.baseten.co/environments/production/sync/predict",
    json={
        "inputs": prompt,
        "raw_scores": True,
    }
)

Benchmarks and Performance optimizations

Embedding models on BEI are fast, and offer currently the fastest implementation for embeddings across all open-source and closed-source providers. The team behind the implementation are the authors of infinity. We recommend using fp8 quantization for LLama, Mistral and Qwen2 models on L4 or newer (L4, H100, H200 and B200). Quality difference between fp8 and bfloat16 is often negligible - embedding models often retentain of >99% cosine simalarity between both presisions, and reranking models retain the ranking order - despite a difference in the retained output.

The team at Baseten has additional options for sharing cached model weights across replicas - for very fast horizontal scaling. Please contact us to enable this option.

Deploy custom or fine-tuned models on BEI:

We support the deployment of of the below models, as well all finetuned variants of these models (same architecture & customized weights). The following repositories are supported - this list is not exhaustive.

Model RepositoryArchitectureFunction
Salesforce/SFR-Embedding-MistralMistralModelembedding
BAAI/bge-m3BertModelembedding
BAAI/bge-multilingual-gemma2Gemma2Modelembedding
mixedbread-ai/mxbai-embed-large-v1BertModelembedding
BAAI/bge-large-en-v1.5BertModelembedding
allenai/Llama-3.1-Tulu-3-8B-RMLlamaForSequenceClassificationclassifier
ncbi/MedCPT-Cross-EncoderBertForSequenceClassificationreranker/classifier
SamLowe/roberta-base-go_emotionsXLMRobertaForSequenceClassificationclassifier
mixedbread/mxbai-rerank-large-v2-seqQwen2ForSequenceClassificationreranker/classifier
BAAI/bge-en-iclLlamaModelembedding
BAAI/bge-reranker-v2-m3BertForSequenceClassificationreranker/classifier
Skywork/Skywork-Reward-Llama-3.1-8B-v0.2LlamaForSequenceClassificationclassifier
Snowflake/snowflake-arctic-embed-lBertModelembedding

1 measured on H100-HBM3 (bert-large-335M, for BAAI/bge-en-icl: 9ms) 2 measured on H100-HBM3 (leading model architecture on MTEB, MistralModel-7B)