Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
Install the OpenAI SDK
- Cost
- Latency
This preset serves Qwen3 Embedding 8B on H100 40GB through Baseten Embeddings Inference (BEI) with FP8 weights, optimized for batch embedding cost.Then create a file named This config tells Baseten to build a BEI (Baseten Embeddings Inference) engine for Qwen3 Embedding 8B on an H100 40GB, drawing FP8 weights from
You should see output similar to:Your model ID is the string after For higher throughput, use the Baseten Performance Client, which batches and pipelines requests automatically.
Hardware
H100_40GB
Engine
TRT-LLM
Write the config
Create and move into the project directory:config.yaml and paste the following:config.yaml
michaelfeil/Qwen3-Embedding-8B-auto, a mirror of the official model with an architecture string compatible with BEI’s encoder build path. FP8 quantization keeps the per-request cost low, which makes this preset a good default for offline indexing and large RAG ingest pipelines.Key parameters
Baseten Embeddings Inference (BEI) reads these fields from thetrt_llm block. Each one shapes how the engine is built and served:| Parameter | Value |
|---|---|
| Quantization | fp8 |
| Base model type | encoder |
Deploy
Push the config to Baseten:/models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible embeddings API at/v1/embeddings. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to generate embeddings:- Python
- cURL
main.py