> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Qwen3 Reranker

> Alibaba's Qwen3 Reranker is an 8B cross-encoder for high-quality passage reranking in retrieval-augmented generation pipelines.

<div className="capability-pills">
  <a href="/examples/models/capabilities/reranking" className="capability-pill">Reranking</a>
  <a href="/examples/models/capabilities/cross-encoder" className="capability-pill">Cross-encoder</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the Python `requests` library.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install requests**

    ```sh theme={"system"}
    uv pip install requests
    ```
  </Column>
</Columns>

[Qwen/Qwen3-Reranker-8B](https://huggingface.co/Qwen/Qwen3-Reranker-8B) is an 8B-parameter dense model.

<CardGroup cols={2}>
  <Card title="Hardware" icon="microchip">H100</Card>
  <Card title="Engine" icon="server">TRT-LLM</Card>
</CardGroup>

## Write the config

Create and move into the project directory:

```sh theme={"system"}
mkdir qwen3-reranker-8b && cd qwen3-reranker-8b
```

Then create a file named `config.yaml` and paste the following:

```yaml config.yaml theme={"system"}
# this file was autogenerated by `generate_templates.py` - please do change via template only
model_metadata:
  example_model_input:
    inputs:
    - - Baseten is a fast inference provider
    - - Classify this separately.
    raw_scores: true
    truncate: true
    truncation_direction: Right
model_name: "model:qwen3-reranker-8b preset:throughput"
python_version: py39
resources:
  accelerator: H100
  cpu: '1'
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    base_model: encoder
    checkpoint_repository:
      repo: michaelfeil/Qwen3-Reranker-8B-seq
      revision: main
      source: HF
    max_num_tokens: 40960
    num_builder_gpus: 1
    quantization_type: fp8
  runtime:
    webserver_default_route: /predict
```

## Key parameters

[Baseten Embeddings Inference](/engines/bei/overview) (BEI) reads these fields from the `trt_llm` block. Each one shapes how the engine is built and served:

| Parameter       | Value     |
| --------------- | --------- |
| Quantization    | `fp8`     |
| Base model type | `encoder` |

## Deploy

Push the config to Baseten:

```sh theme={"system"}
uvx truss push
```

You should see output similar to:

```output theme={"system"}
✨ Model qwen3-reranker-8b was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
```

Your **model ID** is printed in the `truss push` output (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

## Call the model

Your deployment exposes a cross-encoder scoring endpoint at `/predict`. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

Now call your deployment to score candidates:

<Tabs>
  <Tab title="Python">
    ```python main.py theme={"system"}
    import os
    import requests

    response = requests.post(
        "https://model-{model_id}.api.baseten.co/environments/production/sync/predict",
        headers={"Authorization": f"Bearer {os.environ['BASETEN_API_KEY']}"},
        json={
            "query": "fast inference platform",
            "texts": [
                "Baseten serves models on dedicated GPUs.",
                "The Eiffel Tower is in Paris.",
                "Cold-start latency matters for autoscaling.",
            ],
        },
    )

    for hit in response.json():
        print(hit["score"], hit["text"])
    ```
  </Tab>

  <Tab title="cURL">
    ```sh theme={"system"}
    curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/predict \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $BASETEN_API_KEY" \
      -d '{
        "query": "fast inference platform",
        "texts": [
          "Baseten serves models on dedicated GPUs.",
          "The Eiffel Tower is in Paris.",
          "Cold-start latency matters for autoscaling."
        ]
      }'
    ```
  </Tab>
</Tabs>

For batch scoring at higher throughput, use the [Baseten Performance Client](https://www.baseten.co/blog/your-client-code-matters-10x-higher-embedding-throughput-with-python-and-rust/).
