Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

Setup

To get started, sign into Baseten with Truss and then install the Python requests library.
Sign in to Baseten
uvx truss login --browser
Install requests
uv pip install requests
Qwen/Qwen3-Reranker-8B is an 8B-parameter dense model. This variant ships in 2 presets tuned for different goals: Cost for lowest per-request cost, and Latency for lowest time-to-first-token. Pick the tab that matches your workload.
This preset serves Qwen3 Reranker 8B on H100 40GB through Baseten Embeddings Inference (BEI), optimized for batch scoring cost.

Hardware

H100_40GB

Engine

TRT-LLM

Write the config

Create and move into the project directory:
mkdir qwen3-reranker-8b-cost && cd qwen3-reranker-8b-cost
Then create a file named config.yaml and paste the following:
config.yaml
# this file was autogenerated by `generate_templates.py` - please do change via template only
model_metadata:
  example_model_input:
    inputs:
    - - Baseten is a fast inference provider
    - - Classify this separately.
    raw_scores: true
    truncate: true
    truncation_direction: Right
model_name: "model:qwen3-reranker-8b preset:cost"
python_version: py39
resources:
  accelerator: H100_40GB
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/Qwen3-Reranker-8B-seq
      revision: main
      source: HF
    max_num_tokens: 40960
    num_builder_gpus: 1
    quantization_type: fp8
  runtime:
    webserver_default_route: /predict

Key parameters

Baseten Embeddings Inference (BEI) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
ParameterValue
Quantizationfp8
Base model typeencoder

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model qwen3-reranker-8b-cost was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment exposes a cross-encoder scoring endpoint at /predict. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to score candidates:
main.py
import os
import requests

response = requests.post(
    "https://model-{model_id}.api.baseten.co/environments/production/sync/predict",
    headers={"Authorization": f"Api-Key {os.environ['BASETEN_API_KEY']}"},
    json={
        "query": "fast inference platform",
        "texts": [
            "Baseten serves models on dedicated GPUs.",
            "The Eiffel Tower is in Paris.",
            "Cold-start latency matters for autoscaling.",
        ],
    },
)

for hit in response.json():
    print(hit["score"], hit["text"])
For batch scoring at higher throughput, use the Baseten Performance Client.