Qwen3 Reranker

Setup

Sign in to Baseten

uvx truss login --browser

Install requests

uv pip install requests

Pick the model you want to deploy. Each tab is a self-contained recipe.

0.6B
4B
8B

Qwen/Qwen3-Reranker-0.6B is a 0.6B-parameter dense model.This preset serves Qwen3 Reranker 0.6B on a single L4 through Baseten Embeddings Inference (BEI) with FP8 weights, optimized for reranking throughput on low-cost hardware.

Hardware

Engine

TRT-LLM

Write the config

Create and move into the project directory:

mkdir qwen3-reranker-0.6b-throughput && cd qwen3-reranker-0.6b-throughput

Then create a file named config.yaml and paste the following:

config.yaml

# this file was autogenerated by `generate_templates.py` - please do change via template only
model_metadata:
  example_model_input:
    inputs:
    - - Baseten is a fast inference provider
    - - Classify this separately.
    raw_scores: true
    truncate: true
    truncation_direction: Right
model_name: "model:qwen3-reranker-0.6b preset:throughput"
python_version: py39
resources:
  accelerator: L4
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/Qwen3-Reranker-0.6B-seq
      revision: main
      source: HF
    max_num_tokens: 32768
    num_builder_gpus: 1
    quantization_type: fp8
  runtime:
    webserver_default_route: /predict

This config tells Baseten to build a BEI (Baseten Embeddings Inference) engine for Qwen3 Reranker 0.6B on a single L4, drawing FP8 weights from michaelfeil/Qwen3-Reranker-0.6B-seq, a sequence-classification conversion of the official checkpoint compatible with BEI’s encoder build path. The deployment scores query-passage pairs on the /predict route with dynamic batching keeping throughput high.

Key parameters

Baseten Embeddings Inference (BEI) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Quantization	`fp8`
Base model type	`encoder`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model qwen3-reranker-0.6b-throughput was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

truss push prints your model ID (abc1d2ef in the example). The examples below use it wherever you see {model_id}, and read your API key from the BASETEN_API_KEY environment variable.

Call the model

Your deployment exposes a cross-encoder scoring endpoint at /predict.Now call your deployment to score candidates:

Python
cURL

main.py

import os
import requests

response = requests.post(
    "https://model-{model_id}.api.baseten.co/environments/production/sync/predict",
    headers={"Authorization": f"Bearer {os.environ['BASETEN_API_KEY']}"},
    json={
        "query": "fast inference platform",
        "texts": [
            "Baseten serves models on dedicated GPUs.",
            "The Eiffel Tower is in Paris.",
            "Cold-start latency matters for autoscaling.",
        ],
    },
)

for hit in response.json():
    print(hit["score"], hit["text"])

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/predict \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "query": "fast inference platform",
    "texts": [
      "Baseten serves models on dedicated GPUs.",
      "The Eiffel Tower is in Paris.",
      "Cold-start latency matters for autoscaling."
    ]
  }'

For batch scoring at higher throughput, use the Baseten Performance Client.

Qwen/Qwen3-Reranker-4B is a 4B-parameter dense model.This preset serves Qwen3 Reranker 4B on a single H100 through Baseten Embeddings Inference (BEI) with FP8 weights, optimized for reranking throughput.

Hardware

H100

Engine

TRT-LLM

Write the config

Create and move into the project directory:

mkdir qwen3-reranker-4b-throughput && cd qwen3-reranker-4b-throughput

Then create a file named config.yaml and paste the following:

config.yaml

# this file was autogenerated by `generate_templates.py` - please do change via template only
model_metadata:
  example_model_input:
    inputs:
    - - Baseten is a fast inference provider
    - - Classify this separately.
    raw_scores: true
    truncate: true
    truncation_direction: Right
model_name: "model:qwen3-reranker-4b preset:throughput"
python_version: py39
resources:
  accelerator: H100
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/Qwen3-Reranker-4B-seq
      revision: main
      source: HF
    max_num_tokens: 32768
    num_builder_gpus: 1
    quantization_type: fp8
  runtime:
    webserver_default_route: /predict

This config tells Baseten to build a BEI (Baseten Embeddings Inference) engine for Qwen3 Reranker 4B on a single H100, drawing FP8 weights from michaelfeil/Qwen3-Reranker-4B-seq, a sequence-classification conversion of the official checkpoint compatible with BEI’s encoder build path. The deployment scores query-passage pairs on the /predict route with dynamic batching keeping throughput high.

Key parameters

Baseten Embeddings Inference (BEI) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Quantization	`fp8`
Base model type	`encoder`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model qwen3-reranker-4b-throughput was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

truss push prints your model ID (abc1d2ef in the example). The examples below use it wherever you see {model_id}, and read your API key from the BASETEN_API_KEY environment variable.

Call the model

Your deployment exposes a cross-encoder scoring endpoint at /predict.Now call your deployment to score candidates:

Python
cURL

main.py

import os
import requests

response = requests.post(
    "https://model-{model_id}.api.baseten.co/environments/production/sync/predict",
    headers={"Authorization": f"Bearer {os.environ['BASETEN_API_KEY']}"},
    json={
        "query": "fast inference platform",
        "texts": [
            "Baseten serves models on dedicated GPUs.",
            "The Eiffel Tower is in Paris.",
            "Cold-start latency matters for autoscaling.",
        ],
    },
)

for hit in response.json():
    print(hit["score"], hit["text"])

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/predict \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "query": "fast inference platform",
    "texts": [
      "Baseten serves models on dedicated GPUs.",
      "The Eiffel Tower is in Paris.",
      "Cold-start latency matters for autoscaling."
    ]
  }'

For batch scoring at higher throughput, use the Baseten Performance Client.

Qwen/Qwen3-Reranker-8B is an 8B-parameter dense model.

Hardware

H100

Engine

TRT-LLM

Write the config

Create and move into the project directory:

mkdir qwen3-reranker-8b && cd qwen3-reranker-8b

Then create a file named config.yaml and paste the following:

config.yaml

# this file was autogenerated by `generate_templates.py` - please do change via template only
model_metadata:
  example_model_input:
    inputs:
    - - Baseten is a fast inference provider
    - - Classify this separately.
    raw_scores: true
    truncate: true
    truncation_direction: Right
model_name: "model:qwen3-reranker-8b preset:throughput"
python_version: py39
resources:
  accelerator: H100
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/Qwen3-Reranker-8B-seq
      revision: main
      source: HF
    max_num_tokens: 40960
    num_builder_gpus: 1
    quantization_type: fp8
  runtime:
    webserver_default_route: /predict

Key parameters

Baseten Embeddings Inference (BEI) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Quantization	`fp8`
Base model type	`encoder`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model qwen3-reranker-8b was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

truss push prints your model ID (abc1d2ef in the example). The examples below use it wherever you see {model_id}, and read your API key from the BASETEN_API_KEY environment variable.

Call the model

Your deployment exposes a cross-encoder scoring endpoint at /predict.Now call your deployment to score candidates:

Python
cURL

main.py

import os
import requests

response = requests.post(
    "https://model-{model_id}.api.baseten.co/environments/production/sync/predict",
    headers={"Authorization": f"Bearer {os.environ['BASETEN_API_KEY']}"},
    json={
        "query": "fast inference platform",
        "texts": [
            "Baseten serves models on dedicated GPUs.",
            "The Eiffel Tower is in Paris.",
            "Cold-start latency matters for autoscaling.",
        ],
    },
)

for hit in response.json():
    print(hit["score"], hit["text"])

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/predict \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "query": "fast inference platform",
    "texts": [
      "Baseten serves models on dedicated GPUs.",
      "The Eiffel Tower is in Paris.",
      "Cold-start latency matters for autoscaling."
    ]
  }'

For batch scoring at higher throughput, use the Baseten Performance Client.

Next steps

Call your model

Endpoint anatomy, authentication, and sync versus async inference

Autoscaling

Scale replicas with traffic, including scale to zero

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Write the config

Key parameters

Deploy

Call the model

Hardware

Engine

Write the config

Key parameters

Deploy

Call the model

Hardware

Engine

Write the config

Key parameters

Deploy

Call the model

Next steps

Call your model

Autoscaling

​Setup

Hardware

Engine

​Write the config

​Key parameters

​Deploy

​Call the model

Hardware

Engine

​Write the config

​Key parameters

​Deploy

​Call the model

Hardware

Engine

​Write the config

​Key parameters

​Deploy

​Call the model

​Next steps

Call your model

Autoscaling

Setup

Write the config

Key parameters

Deploy

Call the model

Write the config

Key parameters

Deploy

Call the model

Write the config

Key parameters

Deploy

Call the model

Next steps