Qwen3 Embedding

Setup

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

Pick the model you want to deploy. Each tab is a self-contained recipe.

0.6B
4B
8B

Qwen/Qwen3-Embedding-0.6B is a 0.6B-parameter dense model.This preset serves Qwen3 Embedding 0.6B on a single L4 through Baseten Embeddings Inference (BEI) with FP8 weights, optimized for embedding throughput on low-cost hardware.

Hardware

Engine

TRT-LLM

Write the config

Create and move into the project directory:

mkdir qwen3-embedding-0.6b-throughput && cd qwen3-embedding-0.6b-throughput

Then create a file named config.yaml and paste the following:

config.yaml

model_metadata:
  example_model_input:
    input:
      - Baseten is a fast inference provider
      - Embeddings let you do semantic search.
    model: qwen3-embedding-0.6b
model_name: "model:qwen3-embedding-0.6b preset:throughput"
python_version: py39
resources:
  accelerator: L4
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/Qwen3-Embedding-0.6B-auto
      revision: main
      source: HF
    max_num_tokens: 32768
    num_builder_gpus: 1
    quantization_type: fp8
  runtime:
    webserver_default_route: /v1/embeddings

This config tells Baseten to build a BEI (Baseten Embeddings Inference) engine for Qwen3 Embedding 0.6B on a single L4, drawing FP8 weights from michaelfeil/Qwen3-Embedding-0.6B-auto, a mirror of the official checkpoint with an architecture string compatible with BEI’s encoder build path. FP8 quantization on an L4 keeps per-embedding cost low while dynamic batching sustains high throughput.

Key parameters

Baseten Embeddings Inference (BEI) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Quantization	`fp8`
Base model type	`encoder`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model qwen3-embedding-0.6b-throughput was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

truss push prints your model ID (abc1d2ef in the example). The examples below use it wherever you see {model_id}, and read your API key from the BASETEN_API_KEY environment variable.

Call the model

Your deployment serves an OpenAI-compatible embeddings API at /v1/embeddings.Now call your deployment to generate embeddings:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.embeddings.create(
    model="qwen3-embedding-0.6b",
    input=[
        "Baseten is a fast inference provider",
        "Embeddings let you do semantic search.",
    ],
)

for item in response.data:
    print(len(item.embedding), item.embedding[:4])

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "qwen3-embedding-0.6b",
    "input": [
      "Baseten is a fast inference provider",
      "Embeddings let you do semantic search."
    ]
  }'

For higher throughput, use the Baseten Performance Client, which batches and pipelines requests automatically.

Qwen/Qwen3-Embedding-4B is a 4B-parameter dense model.This preset serves Qwen3 Embedding 4B on a single H100 through Baseten Embeddings Inference (BEI) with FP8 weights, optimized for embedding throughput.

Hardware

H100

Engine

TRT-LLM

Write the config

Create and move into the project directory:

mkdir qwen3-embedding-4b-throughput && cd qwen3-embedding-4b-throughput

Then create a file named config.yaml and paste the following:

config.yaml

model_metadata:
  example_model_input:
    input:
      - Baseten is a fast inference provider
      - Embeddings let you do semantic search.
    model: qwen3-embedding-4b
model_name: "model:qwen3-embedding-4b preset:throughput"
python_version: py39
resources:
  accelerator: H100
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/Qwen3-Embedding-4B-auto
      revision: main
      source: HF
    max_num_tokens: 32768
    num_builder_gpus: 1
    quantization_type: fp8
  runtime:
    webserver_default_route: /v1/embeddings

This config tells Baseten to build a BEI (Baseten Embeddings Inference) engine for Qwen3 Embedding 4B on a single H100, drawing FP8 weights from michaelfeil/Qwen3-Embedding-4B-auto, a mirror of the official checkpoint with an architecture string compatible with BEI’s encoder build path. FP8 quantization and dynamic batching keep throughput high for indexing and RAG ingest workloads.

Key parameters

Baseten Embeddings Inference (BEI) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Quantization	`fp8`
Base model type	`encoder`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model qwen3-embedding-4b-throughput was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

truss push prints your model ID (abc1d2ef in the example). The examples below use it wherever you see {model_id}, and read your API key from the BASETEN_API_KEY environment variable.

Call the model

Your deployment serves an OpenAI-compatible embeddings API at /v1/embeddings.Now call your deployment to generate embeddings:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.embeddings.create(
    model="qwen3-embedding-4b",
    input=[
        "Baseten is a fast inference provider",
        "Embeddings let you do semantic search.",
    ],
)

for item in response.data:
    print(len(item.embedding), item.embedding[:4])

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "qwen3-embedding-4b",
    "input": [
      "Baseten is a fast inference provider",
      "Embeddings let you do semantic search."
    ]
  }'

For higher throughput, use the Baseten Performance Client, which batches and pipelines requests automatically.

Qwen/Qwen3-Embedding-8B is an 8B-parameter dense model.

Hardware

H100

Engine

TRT-LLM

Write the config

Create and move into the project directory:

mkdir qwen3-embedding-8b && cd qwen3-embedding-8b

Then create a file named config.yaml and paste the following:

config.yaml

model_metadata:
  example_model_input:
    input:
      - Baseten is a fast inference provider
      - Embeddings let you do semantic search.
    model: qwen3-embedding-8b
model_name: "model:qwen3-embedding-8b preset:throughput"
python_version: py39
resources:
  accelerator: H100
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/Qwen3-Embedding-8B-auto
      revision: main
      source: HF
    max_num_tokens: 40960
    num_builder_gpus: 1
    quantization_type: fp8
  runtime:
    webserver_default_route: /v1/embeddings

Key parameters

Baseten Embeddings Inference (BEI) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Quantization	`fp8`
Base model type	`encoder`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model qwen3-embedding-8b was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

truss push prints your model ID (abc1d2ef in the example). The examples below use it wherever you see {model_id}, and read your API key from the BASETEN_API_KEY environment variable.

Call the model

Your deployment serves an OpenAI-compatible embeddings API at /v1/embeddings.Now call your deployment to generate embeddings:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.embeddings.create(
    model="qwen3-embedding-8b",
    input=[
        "Baseten is a fast inference provider",
        "Embeddings let you do semantic search.",
    ],
)

for item in response.data:
    print(len(item.embedding), item.embedding[:4])

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "qwen3-embedding-8b",
    "input": [
      "Baseten is a fast inference provider",
      "Embeddings let you do semantic search."
    ]
  }'

For higher throughput, use the Baseten Performance Client, which batches and pipelines requests automatically.

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Qwen3 Embedding

Setup

Hardware

Engine

Write the config

Key parameters

Deploy

Call the model

Hardware

Engine

Write the config

Key parameters

Deploy

Call the model

Hardware

Engine

Write the config

Key parameters

Deploy

Call the model

Next steps

Call your model

Autoscaling

​Setup

Hardware

Engine

​Write the config

​Key parameters

​Deploy

​Call the model

Hardware

Engine

​Write the config

​Key parameters

​Deploy

​Call the model

Hardware

Engine

​Write the config

​Key parameters

​Deploy

​Call the model

​Next steps

Call your model

Autoscaling

Setup

Write the config

Key parameters

Deploy

Call the model

Write the config

Key parameters

Deploy

Call the model

Write the config

Key parameters

Deploy

Call the model

Next steps