Serve embedding, reranking, and classification models
Baseten Embeddings Inference is Baseten’s solution for production grade inference on embedding, classification and reranking models using TensorRT-LLM.
With Baseten Embeddings Inference you get the following benefits:
Embedding models are LLMs without a lm_head for language generation.
Typical architectures that are supported for embeddings are LlamaModel
, BertModel
, RobertaModel
or Gemma2Model
, and contain the safetensors, config, tokenizer and sentence-transformer config files.
A good example is the repo BAAI/bge-multilingual-gemma2.
To deploy a model for embeddings, set the following config in your local directory.
With config.yaml
in your local directory, you can deploy the model to Baseten.
Deployed embedding models are OpenAI compatible without any additional settings. You may use the client code below to consume the model.
Besides embedding models, BEI deploys high-throughput rerank and classification models.
You can identify suitable architectures by their ForSequenceClassification
suffix in the huggingface repo.
The use-case for these models is either Reward Modeling, Reranking documents in RAG or tasks like content moderation.
As OpenAI does not offer reranking or classification, we are sending a simple request to the endpoint. Depending on the model, you might want to apply a specific prompt template first.
Embedding models on BEI are fast, and offer currently the fastest implementation for embeddings across all open-source and closed-source providers. The team behind the implementation are the authors of infinity. We recommend using fp8 quantization for LLama, Mistral and Qwen2 models on L4 or newer (L4, H100, H200 and B200). Quality difference between fp8 and bfloat16 is often negligible - embedding models often retentain of >99% cosine simalarity between both presisions, and reranking models retain the ranking order - despite a difference in the retained output. For more details, check out the technical launch post.
The team at Baseten has additional options for sharing cached model weights across replicas - for very fast horizontal scaling. Please contact us to enable this option.
We support the deployment of of the below models, as well all finetuned variants of these models (same architecture & customized weights). The following repositories are supported - this list is not exhaustive.
Model Repository | Architecture | Function |
---|---|---|
Salesforce/SFR-Embedding-Mistral | MistralModel | embedding |
BAAI/bge-m3 | BertModel | embedding |
BAAI/bge-multilingual-gemma2 | Gemma2Model | embedding |
mixedbread-ai/mxbai-embed-large-v1 | BertModel | embedding |
BAAI/bge-large-en-v1.5 | BertModel | embedding |
allenai/Llama-3.1-Tulu-3-8B-RM | LlamaForSequenceClassification | classifier |
ncbi/MedCPT-Cross-Encoder | BertForSequenceClassification | reranker/classifier |
SamLowe/roberta-base-go_emotions | XLMRobertaForSequenceClassification | classifier |
mixedbread/mxbai-rerank-large-v2-seq | Qwen2ForSequenceClassification | reranker/classifier |
BAAI/bge-en-icl | LlamaModel | embedding |
BAAI/bge-reranker-v2-m3 | BertForSequenceClassification | reranker/classifier |
Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | LlamaForSequenceClassification | classifier |
Snowflake/snowflake-arctic-embed-l | BertModel | embedding |
nomic-ai/nomic-embed-code | Qwen2Model | embedding |
1 measured on H100-HBM3 (bert-large-335M, for BAAI/bge-en-icl: 9ms) 2 measured on H100-HBM3 (leading model architecture on MTEB, MistralModel-7B)