Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Basetenuvx truss login --browser
Qwen/Qwen3-Embedding-8B is an 8B-parameter dense model.
This variant ships in 2 presets tuned for different goals: Cost for lowest per-request cost, and Latency for lowest time-to-first-token. Pick the tab that matches your workload.
This preset serves Qwen3 Embedding 8B on H100 40GB through Baseten Embeddings Inference (BEI) with FP8 weights, optimized for batch embedding cost.Write the config
Create and move into the project directory:mkdir qwen3-embedding-8b-cost && cd qwen3-embedding-8b-cost
Then create a file named config.yaml and paste the following:model_metadata:
example_model_input:
input:
- Baseten is a fast inference provider
- Embeddings let you do semantic search.
model: qwen3-embedding-8b
model_name: "model:qwen3-embedding-8b preset:cost"
python_version: py39
resources:
accelerator: H100_40GB
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
build:
base_model: encoder
checkpoint_repository:
repo: michaelfeil/Qwen3-Embedding-8B-auto
revision: main
source: HF
max_num_tokens: 40960
num_builder_gpus: 1
quantization_type: fp8
runtime:
webserver_default_route: /v1/embeddings
This config tells Baseten to build a BEI (Baseten Embeddings Inference) engine for Qwen3 Embedding 8B on an H100 40GB, drawing FP8 weights from michaelfeil/Qwen3-Embedding-8B-auto, a mirror of the official model with an architecture string compatible with BEI’s encoder build path. FP8 quantization keeps the per-request cost low, which makes this preset a good default for offline indexing and large RAG ingest pipelines.Key parameters
Baseten Embeddings Inference (BEI) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:| Parameter | Value |
|---|
| Quantization | fp8 |
| Base model type | encoder |
Deploy
Push the config to Baseten:You should see output similar to:✨ Model qwen3-embedding-8b-cost was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible embeddings API at /v1/embeddings. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to generate embeddings:import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.embeddings.create(
model="qwen3-embedding-8b",
input=[
"Baseten is a fast inference provider.",
"Embeddings power semantic search and RAG.",
],
)
for item in response.data:
print(len(item.embedding), item.embedding[:4])
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "qwen3-embedding-8b",
"input": [
"Baseten is a fast inference provider.",
"Embeddings power semantic search and RAG."
]
}'
For higher throughput, use the Baseten Performance Client, which batches and pipelines requests automatically. This preset serves Qwen3 Embedding 8B on a single B200 through Baseten Embeddings Inference (BEI) with FP4 weights, optimized for the lowest single-request embedding latency.Write the config
Create and move into the project directory:mkdir qwen3-embedding-8b-latency && cd qwen3-embedding-8b-latency
Then create a file named config.yaml and paste the following:model_metadata:
example_model_input:
input:
- Baseten is a fast inference provider
- Embeddings let you do semantic search.
model: qwen3-embedding-8b
model_name: "model:qwen3-embedding-8b preset:latency"
python_version: py39
resources:
accelerator: B200
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
build:
base_model: encoder
checkpoint_repository:
repo: michaelfeil/Qwen3-Embedding-8B-auto
revision: main
source: HF
max_num_tokens: 40960
num_builder_gpus: 1
quantization_type: fp4
runtime:
webserver_default_route: /v1/embeddings
This config tells Baseten to build a BEI (Baseten Embeddings Inference) engine for Qwen3 Embedding 8B on a single B200, drawing FP4 weights from michaelfeil/Qwen3-Embedding-8B-auto. The B200’s native FP4 tensor cores keep single-request embedding latency low, which is the right trade for interactive RAG and real-time semantic search.Key parameters
Baseten Embeddings Inference (BEI) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:| Parameter | Value |
|---|
| Quantization | fp4 |
| Base model type | encoder |
Deploy
Push the config to Baseten:You should see output similar to:✨ Model qwen3-embedding-8b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible embeddings API at /v1/embeddings. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to generate embeddings:import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.embeddings.create(
model="qwen3-embedding-8b",
input=[
"Baseten is a fast inference provider.",
"Embeddings power semantic search and RAG.",
],
)
for item in response.data:
print(len(item.embedding), item.embedding[:4])
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "qwen3-embedding-8b",
"input": [
"Baseten is a fast inference provider.",
"Embeddings power semantic search and RAG."
]
}'
For higher throughput, use the Baseten Performance Client, which batches and pipelines requests automatically.