Skip to main content

Setup

To get started, sign into Baseten with Truss and then install the Python requests library.
Sign in to Baseten
uvx truss login --browser
Install requests
uv pip install requests
Qwen/Qwen3-Reranker-8B is an 8B-parameter dense model. This variant ships in 2 presets tuned for different goals: Cost for lowest per-request cost, and Latency for lowest time-to-first-token. Pick the tab that matches your workload.
This preset serves Qwen3 Reranker 8B on H100 40GB through Baseten Embeddings Inference (BEI), optimized for batch scoring cost.

Hardware

H100_40GB

Engine

TRT-LLM

Write the config

Create and move into the project directory:
mkdir qwen3-reranker-8b-cost && cd qwen3-reranker-8b-cost
Then create a file named config.yaml and paste the following:
config.yaml
# this file was autogenerated by `generate_templates.py` - please do change via template only
model_metadata:
  example_model_input:
    inputs:
    - - Baseten is a fast inference provider
    - - Classify this separately.
    raw_scores: true
    truncate: true
    truncation_direction: Right
model_name: "model:qwen3-reranker-8b preset:cost"
python_version: py39
resources:
  accelerator: H100_40GB
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/Qwen3-Reranker-8B-seq
      revision: main
      source: HF
    max_num_tokens: 40960
    num_builder_gpus: 1
    quantization_type: fp8
  runtime:
    webserver_default_route: /predict

Key parameters

Baseten Embeddings Inference (BEI) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
ParameterValue
Quantizationfp8
Base model typeencoder

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model qwen3-reranker-8b-cost was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment exposes a cross-encoder scoring endpoint at /predict. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to score candidates:
main.py
import os
import requests

response = requests.post(
    "https://model-{model_id}.api.baseten.co/environments/production/sync/predict",
    headers={"Authorization": f"Bearer {os.environ['BASETEN_API_KEY']}"},
    json={
        "query": "fast inference platform",
        "texts": [
            "Baseten serves models on dedicated GPUs.",
            "The Eiffel Tower is in Paris.",
            "Cold-start latency matters for autoscaling.",
        ],
    },
)

for hit in response.json():
    print(hit["score"], hit["text"])
For batch scoring at higher throughput, use the Baseten Performance Client.