Skip to main content

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
Qwen/Qwen3-Embedding-8B is an 8B-parameter dense model.

Hardware

H100

Engine

TRT-LLM

Write the config

Create and move into the project directory:
mkdir qwen3-embedding-8b && cd qwen3-embedding-8b
Then create a file named config.yaml and paste the following:
config.yaml
model_metadata:
  example_model_input:
    input:
      - Baseten is a fast inference provider
      - Embeddings let you do semantic search.
    model: qwen3-embedding-8b
model_name: "model:qwen3-embedding-8b preset:throughput"
python_version: py39
resources:
  accelerator: H100
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/Qwen3-Embedding-8B-auto
      revision: main
      source: HF
    max_num_tokens: 40960
    num_builder_gpus: 1
    quantization_type: fp8
  runtime:
    webserver_default_route: /v1/embeddings

Key parameters

Baseten Embeddings Inference (BEI) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
ParameterValue
Quantizationfp8
Base model typeencoder

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model qwen3-embedding-8b was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
Your model ID is printed in the truss push output (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible embeddings API at /v1/embeddings. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set. Now call your deployment to generate embeddings:
main.py
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.embeddings.create(
    model="qwen3-embedding-8b",
    input=[
        "Baseten is a fast inference provider.",
        "Embeddings power semantic search and RAG.",
    ],
)

for item in response.data:
    print(len(item.embedding), item.embedding[:4])
For higher throughput, use the Baseten Performance Client, which batches and pipelines requests automatically.