- Lowest-latency inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama)1
- Highest-throughput inference across any embedding solution (vLLM, SGlang, Infinity, TEI, Ollama) - thanks to XQA kernels, FP8 and dynamic batching.2
- High parallelism: up to 1400 client embeddings per second
- Cached model weights for fast vertical scaling and high availability - no Hugging Face hub dependency at runtime
- Ahead-of-time compilation, memory allocation and fp8 post-training quantization
Getting started with embedding models:
Embedding models are LLMs without a lm_head for language generation. Typical architectures that are supported for embeddings areLlamaModel
, BertModel
, RobertaModel
or Gemma2Model
, and contain the safetensors, config, tokenizer and sentence-transformer config files.
A good example is the repo BAAI/bge-multilingual-gemma2.
To deploy a model for embeddings, set the following config in your local directory.
config.yaml
config.yaml
in your local directory, you can deploy the model to Baseten.
Example deployment of a classification, reranking and classification models
Besides embedding models, BEI deploys high-throughput rerank and classification models. You can identify suitable architectures by theirForSequenceClassification
suffix in the huggingface repo.
The use-case for these models is either Reward Modeling, Reranking documents in RAG or tasks like content moderation.
Benchmarks and Performance optimizations
Embedding models on BEI are fast, and offer currently the fastest implementation for embeddings across all open-source and closed-source providers. The team behind the implementation are the authors of infinity. We recommend using fp8 quantization for LLama, Mistral and Qwen2 models on L4 or newer (L4, H100, H200 and B200). Quality difference between fp8 and bfloat16 is often negligible - embedding models often retentain of >99% cosine simalarity between both presisions, and reranking models retain the ranking order - despite a difference in the retained output. For more details, check out the technical launch post.
Deploy custom or fine-tuned models on BEI:
We support the deployment of of the below models, as well all finetuned variants of these models (same architecture & customized weights). The following repositories are supported - this list is not exhaustive.Model Repository | Architecture | Function |
---|---|---|
Salesforce/SFR-Embedding-Mistral | MistralModel | embedding |
BAAI/bge-m3 | BertModel | embedding |
BAAI/bge-multilingual-gemma2 | Gemma2Model | embedding |
mixedbread-ai/mxbai-embed-large-v1 | BertModel | embedding |
BAAI/bge-large-en-v1.5 | BertModel | embedding |
allenai/Llama-3.1-Tulu-3-8B-RM | LlamaForSequenceClassification | classifier |
ncbi/MedCPT-Cross-Encoder | BertForSequenceClassification | reranker/classifier |
SamLowe/roberta-base-go_emotions | XLMRobertaForSequenceClassification | classifier |
mixedbread/mxbai-rerank-large-v2-seq | Qwen2ForSequenceClassification | reranker/classifier |
BAAI/bge-en-icl | LlamaModel | embedding |
BAAI/bge-reranker-v2-m3 | BertForSequenceClassification | reranker/classifier |
Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | LlamaForSequenceClassification | classifier |
Snowflake/snowflake-arctic-embed-l | BertModel | embedding |
nomic-ai/nomic-embed-code | Qwen2Model | embedding |