> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Qwen3 Embedding

> Alibaba's Qwen3 Embedding is an 8B text embedding model that maps text into dense vectors for semantic search, retrieval-augmented generation, clustering, and classification.

<div className="capability-pills">
  <a href="/examples/models/capabilities/embedding" className="capability-pill">Embeddings</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install the OpenAI SDK**

    ```sh theme={"system"}
    uv pip install openai
    ```
  </Column>
</Columns>

[Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B) is an 8B-parameter dense model.

This variant ships in 2 presets tuned for different goals: **Cost** for lowest per-request cost, and **Latency** for lowest time-to-first-token. Pick the tab that matches your workload.

<Tabs>
  <Tab title="Cost">
    This preset serves Qwen3 Embedding 8B on H100 40GB through [Baseten Embeddings Inference](/engines/bei/overview) (BEI) with FP8 weights, optimized for batch embedding cost.

    <CardGroup cols={2}>
      <Card title="Hardware" icon="microchip">H100\_40GB</Card>
      <Card title="Engine" icon="server">TRT-LLM</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir qwen3-embedding-8b-cost && cd qwen3-embedding-8b-cost
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_metadata:
      example_model_input:
        input:
          - Baseten is a fast inference provider
          - Embeddings let you do semantic search.
        model: qwen3-embedding-8b
    model_name: "model:qwen3-embedding-8b preset:cost"
    python_version: py39
    resources:
      accelerator: H100_40GB
      cpu: '1'
      memory: 10Gi
      use_gpu: true
    trt_llm:
      build:
        base_model: encoder
        checkpoint_repository:
          repo: michaelfeil/Qwen3-Embedding-8B-auto
          revision: main
          source: HF
        max_num_tokens: 40960
        num_builder_gpus: 1
        quantization_type: fp8
      runtime:
        webserver_default_route: /v1/embeddings
    ```

    This config tells Baseten to build a BEI (Baseten Embeddings Inference) engine for Qwen3 Embedding 8B on an H100 40GB, drawing FP8 weights from `michaelfeil/Qwen3-Embedding-8B-auto`, a mirror of the official model with an architecture string compatible with BEI's encoder build path. FP8 quantization keeps the per-request cost low, which makes this preset a good default for offline indexing and large RAG ingest pipelines.

    ## Key parameters

    [Baseten Embeddings Inference](/engines/bei/overview) (BEI) reads these fields from the `trt_llm` block. Each one shapes how the engine is built and served:

    | Parameter       | Value     |
    | --------------- | --------- |
    | Quantization    | `fp8`     |
    | Base model type | `encoder` |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model qwen3-embedding-8b-cost was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible embeddings API at `/v1/embeddings`. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to generate embeddings:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.embeddings.create(
            model="qwen3-embedding-8b",
            input=[
                "Baseten is a fast inference provider.",
                "Embeddings power semantic search and RAG.",
            ],
        )

        for item in response.data:
            print(len(item.embedding), item.embedding[:4])
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/embeddings \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "qwen3-embedding-8b",
            "input": [
              "Baseten is a fast inference provider.",
              "Embeddings power semantic search and RAG."
            ]
          }'
        ```
      </Tab>
    </Tabs>

    For higher throughput, use the [Baseten Performance Client](https://www.baseten.co/blog/your-client-code-matters-10x-higher-embedding-throughput-with-python-and-rust/), which batches and pipelines requests automatically.
  </Tab>

  <Tab title="Latency">
    This preset serves Qwen3 Embedding 8B on a single B200 through [Baseten Embeddings Inference](/engines/bei/overview) (BEI) with FP4 weights, optimized for the lowest single-request embedding latency.

    <CardGroup cols={2}>
      <Card title="Hardware" icon="microchip">B200</Card>
      <Card title="Engine" icon="server">TRT-LLM</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir qwen3-embedding-8b-latency && cd qwen3-embedding-8b-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_metadata:
      example_model_input:
        input:
          - Baseten is a fast inference provider
          - Embeddings let you do semantic search.
        model: qwen3-embedding-8b
    model_name: "model:qwen3-embedding-8b preset:latency"
    python_version: py39
    resources:
      accelerator: B200
      cpu: '1'
      memory: 10Gi
      use_gpu: true
    trt_llm:
      build:
        base_model: encoder
        checkpoint_repository:
          repo: michaelfeil/Qwen3-Embedding-8B-auto
          revision: main
          source: HF
        max_num_tokens: 40960
        num_builder_gpus: 1
        quantization_type: fp4
      runtime:
        webserver_default_route: /v1/embeddings
    ```

    This config tells Baseten to build a BEI (Baseten Embeddings Inference) engine for Qwen3 Embedding 8B on a single B200, drawing FP4 weights from `michaelfeil/Qwen3-Embedding-8B-auto`. The B200's native FP4 tensor cores keep single-request embedding latency low, which is the right trade for interactive RAG and real-time semantic search.

    ## Key parameters

    [Baseten Embeddings Inference](/engines/bei/overview) (BEI) reads these fields from the `trt_llm` block. Each one shapes how the engine is built and served:

    | Parameter       | Value     |
    | --------------- | --------- |
    | Quantization    | `fp4`     |
    | Base model type | `encoder` |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model qwen3-embedding-8b-latency was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible embeddings API at `/v1/embeddings`. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to generate embeddings:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.embeddings.create(
            model="qwen3-embedding-8b",
            input=[
                "Baseten is a fast inference provider.",
                "Embeddings power semantic search and RAG.",
            ],
        )

        for item in response.data:
            print(len(item.embedding), item.embedding[:4])
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/embeddings \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "qwen3-embedding-8b",
            "input": [
              "Baseten is a fast inference provider.",
              "Embeddings power semantic search and RAG."
            ]
          }'
        ```
      </Tab>
    </Tabs>

    For higher throughput, use the [Baseten Performance Client](https://www.baseten.co/blog/your-client-code-matters-10x-higher-embedding-throughput-with-python-and-rust/), which batches and pipelines requests automatically.
  </Tab>
</Tabs>
