Embeddings with BEI
Serve embedding, reranking, and classification models
Baseten Embeddings Inference is Baseten’s solution for production grade inference on embedding, classification and reranking models using TensorRT-LLM.
With Baseten Embeddings Inference you get the following benefits:
- Lowest-latency inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)1
- Highest-throughput inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.2
- High parallelism: up to 1400 client embeddings per second
- Cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime
- Ahead-of-time compilation, memory allocation and fp8 post-training quantization
Getting started with embedding models:
Embedding models are LLMs without a lm_head for language generation.
Typical architectures that are supported for embeddings are LlamaModel
, BertModel
, RobertaModel
or Gemma2Model
, and contain the safetensors, config, tokenizer and sentence-transformer config files.
A good example is the repo BAAI/bge-multilingual-gemma2.
To deploy a model for embeddings, set the following config in your local directory.
With config.yaml
in your local directory, you can deploy the model to Baseten.
Deployed embedding models are OpenAI compatible without any additional settings. You may use the client code below to consume the model.
Example deployment of a classification, reranking and classification models
Besides embedding models, BEI deploys high-throughput rerank and classification models.
You can identify suitable architectures by their ForSequenceClassification
suffix in the huggingface repo.
The use-case for these models is either Reward Modeling, Reranking documents in RAG or tasks like content moderation.
As OpenAI does not offer reranking or classification, we are sending a simple request to the endpoint. Depending on the model, you might want to apply a specific prompt template first.
Benchmarks and Performance optimizations
Embedding models on BEI are fast, and offer currently the fastest implementation for embeddings across all open-source and closed-source providers. The team behind the implementation are the authors of infinity. We recommend using fp8 quantization for LLama, Mistral and Qwen2 models on L4 or newer (L4, H100, H200 and B200). Quality difference between fp8 and bfloat16 is often negligible - embedding models often retentain of >99% cosine simalarity between both presisions, and reranking models retain the ranking order - despite a difference in the retained output.
The team at Baseten has additional options for sharing cached model weights across replicas - for very fast horizontal scaling. Please contact us to enable this option.
Deploy custom or fine-tuned models on BEI:
We support the deployment of of the below models, as well all finetuned variants of these models (same architecture & customized weights). The following repositories are supported - this list is not exhaustive.
Model Repository | Architecture | Function |
---|---|---|
Salesforce/SFR-Embedding-Mistral | MistralModel | embedding |
BAAI/bge-m3 | BertModel | embedding |
BAAI/bge-multilingual-gemma2 | Gemma2Model | embedding |
mixedbread-ai/mxbai-embed-large-v1 | BertModel | embedding |
BAAI/bge-large-en-v1.5 | BertModel | embedding |
allenai/Llama-3.1-Tulu-3-8B-RM | LlamaForSequenceClassification | classifier |
ncbi/MedCPT-Cross-Encoder | BertForSequenceClassification | reranker/classifier |
SamLowe/roberta-base-go_emotions | XLMRobertaForSequenceClassification | classifier |
mixedbread/mxbai-rerank-large-v2-seq | Qwen2ForSequenceClassification | reranker/classifier |
BAAI/bge-en-icl | LlamaModel | embedding |
BAAI/bge-reranker-v2-m3 | BertForSequenceClassification | reranker/classifier |
Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | LlamaForSequenceClassification | classifier |
Snowflake/snowflake-arctic-embed-l | BertModel | embedding |
1 measured on H100-HBM3 (bert-large-335M, for BAAI/bge-en-icl: 9ms) 2 measured on H100-HBM3 (leading model architecture on MTEB, MistralModel-7B)
Was this page helpful?