> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Qwen3 Reranker

> Alibaba's Qwen3 Reranker is an 8B cross-encoder for high-quality passage reranking in retrieval-augmented generation pipelines.

<div className="capability-pills">
  <a href="/examples/models/capabilities/reranking" className="capability-pill">Reranking</a>
  <a href="/examples/models/capabilities/cross-encoder" className="capability-pill">Cross-encoder</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the Python `requests` library.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install requests**

    ```sh theme={"system"}
    uv pip install requests
    ```
  </Column>
</Columns>

[Qwen/Qwen3-Reranker-8B](https://huggingface.co/Qwen/Qwen3-Reranker-8B) is an 8B-parameter dense model.

This variant ships in 2 presets tuned for different goals: **Cost** for lowest per-request cost, and **Latency** for lowest time-to-first-token. Pick the tab that matches your workload.

<Tabs>
  <Tab title="Cost">
    This preset serves Qwen3 Reranker 8B on H100 40GB through [Baseten Embeddings Inference](/engines/bei/overview) (BEI), optimized for batch scoring cost.

    <CardGroup cols={2}>
      <Card title="Hardware" icon="microchip">H100\_40GB</Card>
      <Card title="Engine" icon="server">TRT-LLM</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir qwen3-reranker-8b-cost && cd qwen3-reranker-8b-cost
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    # this file was autogenerated by `generate_templates.py` - please do change via template only
    model_metadata:
      example_model_input:
        inputs:
        - - Baseten is a fast inference provider
        - - Classify this separately.
        raw_scores: true
        truncate: true
        truncation_direction: Right
    model_name: "model:qwen3-reranker-8b preset:cost"
    python_version: py39
    resources:
      accelerator: H100_40GB
      cpu: '1'
      memory: 10Gi
      use_gpu: true
    trt_llm:
      build:
        base_model: encoder
        checkpoint_repository:
          repo: michaelfeil/Qwen3-Reranker-8B-seq
          revision: main
          source: HF
        max_num_tokens: 40960
        num_builder_gpus: 1
        quantization_type: fp8
      runtime:
        webserver_default_route: /predict
    ```

    ## Key parameters

    [Baseten Embeddings Inference](/engines/bei/overview) (BEI) reads these fields from the `trt_llm` block. Each one shapes how the engine is built and served:

    | Parameter       | Value     |
    | --------------- | --------- |
    | Quantization    | `fp8`     |
    | Base model type | `encoder` |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model qwen3-reranker-8b-cost was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment exposes a cross-encoder scoring endpoint at `/predict`. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to score candidates:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        import requests

        response = requests.post(
            "https://model-{model_id}.api.baseten.co/environments/production/sync/predict",
            headers={"Authorization": f"Bearer {os.environ['BASETEN_API_KEY']}"},
            json={
                "query": "fast inference platform",
                "texts": [
                    "Baseten serves models on dedicated GPUs.",
                    "The Eiffel Tower is in Paris.",
                    "Cold-start latency matters for autoscaling.",
                ],
            },
        )

        for hit in response.json():
            print(hit["score"], hit["text"])
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/predict \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "query": "fast inference platform",
            "texts": [
              "Baseten serves models on dedicated GPUs.",
              "The Eiffel Tower is in Paris.",
              "Cold-start latency matters for autoscaling."
            ]
          }'
        ```
      </Tab>
    </Tabs>

    For batch scoring at higher throughput, use the [Baseten Performance Client](https://www.baseten.co/blog/your-client-code-matters-10x-higher-embedding-throughput-with-python-and-rust/).
  </Tab>

  <Tab title="Latency">
    This preset serves Qwen3 Reranker 8B on B200 through [Baseten Embeddings Inference](/engines/bei/overview) (BEI), optimized for the lowest reranker latency.

    <CardGroup cols={2}>
      <Card title="Hardware" icon="microchip">B200</Card>
      <Card title="Engine" icon="server">TRT-LLM</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir qwen3-reranker-8b-latency && cd qwen3-reranker-8b-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    # this file was autogenerated by `generate_templates.py` - please do change via template only
    model_metadata:
      example_model_input:
        inputs:
        - - Baseten is a fast inference provider
        - - Classify this separately.
        raw_scores: true
        truncate: true
        truncation_direction: Right
    model_name: "model:qwen3-reranker-8b preset:latency"
    python_version: py39
    resources:
      accelerator: B200
      cpu: '1'
      memory: 10Gi
      use_gpu: true
    trt_llm:
      build:
        base_model: encoder
        checkpoint_repository:
          repo: michaelfeil/Qwen3-Reranker-8B-seq
          revision: main
          source: HF
        max_num_tokens: 40960
        num_builder_gpus: 1
        quantization_type: fp4
      runtime:
        webserver_default_route: /predict
    ```

    ## Key parameters

    [Baseten Embeddings Inference](/engines/bei/overview) (BEI) reads these fields from the `trt_llm` block. Each one shapes how the engine is built and served:

    | Parameter       | Value     |
    | --------------- | --------- |
    | Quantization    | `fp4`     |
    | Base model type | `encoder` |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model qwen3-reranker-8b-latency was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment exposes a cross-encoder scoring endpoint at `/predict`. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to score candidates:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        import requests

        response = requests.post(
            "https://model-{model_id}.api.baseten.co/environments/production/sync/predict",
            headers={"Authorization": f"Bearer {os.environ['BASETEN_API_KEY']}"},
            json={
                "query": "fast inference platform",
                "texts": [
                    "Baseten serves models on dedicated GPUs.",
                    "The Eiffel Tower is in Paris.",
                    "Cold-start latency matters for autoscaling.",
                ],
            },
        )

        for hit in response.json():
            print(hit["score"], hit["text"])
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/predict \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "query": "fast inference platform",
            "texts": [
              "Baseten serves models on dedicated GPUs.",
              "The Eiffel Tower is in Paris.",
              "Cold-start latency matters for autoscaling."
            ]
          }'
        ```
      </Tab>
    </Tabs>

    For batch scoring at higher throughput, use the [Baseten Performance Client](https://www.baseten.co/blog/your-client-code-matters-10x-higher-embedding-throughput-with-python-and-rust/).
  </Tab>
</Tabs>
