> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Qwen3.6

> Qwen3.6 recipes: 2 variants (27B, 35B-A3B), Dense and Hybrid MoE architectures.

<div className="capability-pills">
  <a href="/examples/models/capabilities/reasoning" className="capability-pill">Reasoning</a>
  <a href="/examples/models/capabilities/tool-calling" className="capability-pill">Tool calling</a>
  <a href="/examples/models/capabilities/agentic" className="capability-pill">Agentic</a>
  <a href="/examples/models/capabilities/long-context" className="capability-pill">Long context</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install the OpenAI SDK**

    ```sh theme={"system"}
    uv pip install openai
    ```
  </Column>
</Columns>

Pick the model you want to deploy. Each tab is a self-contained recipe.

<Tabs>
  <Tab title="27B">
    [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) is a 27B-parameter dense model with up to 256K context.

    This preset serves Qwen3.6-27B on H100:4 with MTP speculative decoding, optimized for low time-to-first-token on interactive chat and agent workflows.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100 × 4</Card>
      <Card title="Engine" icon="server">vLLM 0.20.0</Card>
      <Card title="Context" icon="ruler-horizontal">256K</Card>
      <Card title="Concurrency" icon="layer-group">64</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir qwen3.6-27b-latency && cd qwen3.6-27b-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: "model:qwen3.6-27b preset:latency"

    model_metadata:
      example_model_input:
        model: "Qwen/Qwen3.6-27B"
        messages:
          - role: user
            content: "What is the capital of France?"
        stream: true
        max_tokens: 512
        temperature: 1.0
        top_p: 0.95
      tags:
        - openai-compatible

    base_image:
      image: vllm/vllm-openai:v0.20.0

    weights:
      - source: "hf://Qwen/Qwen3.6-27B@main"
        mount_location: "/app/checkpoint/qwen3.6-27b"
        auth_secret_name: "hf_access_token"

    resources:
      accelerator: H100:4
      use_gpu: true

    runtime:
      predict_concurrency: 64

    environment_variables:
      HF_HUB_ENABLE_HF_TRANSFER: "1"
      VLLM_LOGGING_LEVEL: WARNING

    secrets:
      hf_access_token: null

    docker_server:
      start_command: >-
        sh -c "vllm serve /app/checkpoint/qwen3.6-27b
        --served-model-name Qwen/Qwen3.6-27B
        --host 0.0.0.0
        --port 8000
        --trust-remote-code
        --tensor-parallel-size 4
        --max-model-len 262144
        --language-model-only
        --reasoning-parser qwen3
        --enable-auto-tool-choice
        --tool-call-parser qwen3_coder
        --speculative_config.method mtp
        --speculative_config.num_speculative_tokens 2"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                                          | Value         | What it does                                                                                                                                                |
    | --------------------------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | `--trust-remote-code`                         | (no value)    | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |
    | `--tensor-parallel-size`                      | `4`           | Number of GPUs to shard the model across.                                                                                                                   |
    | `--max-model-len`                             | `262144`      | Maximum context length (tokens) the server accepts per request.                                                                                             |
    | `--language-model-only`                       | (no value)    | Disable the multimodal path; text-only serving. Remove to enable image/video inputs.                                                                        |
    | `--reasoning-parser`                          | `qwen3`       | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
    | `--enable-auto-tool-choice`                   | (no value)    | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                        |
    | `--tool-call-parser`                          | `qwen3_coder` | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                           |
    | `--speculative_config.method`                 | `mtp`         | Speculative decoding method. **mtp:** Multi-token prediction head speculation.                                                                              |
    | `--speculative_config.num_speculative_tokens` | `2`           | Number of tokens the draft speculator proposes per step.                                                                                                    |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model qwen3.6-27b-latency was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="Qwen/Qwen3.6-27B",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "Qwen/Qwen3.6-27B",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

    ```python theme={"system"}
    response = client.chat.completions.create(
        model="Qwen/Qwen3.6-27B",
        messages=[
            {"role": "user", "content": "How many r's in strawberry?"}
        ],
        extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    )
    print(response.choices[0].message.reasoning_content)  # chain of thought
    print(response.choices[0].message.content)            # final answer
    ```

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="Qwen/Qwen3.6-27B",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>

  <Tab title="35B-A3B">
    [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) is a 35B-parameter hybrid MoE model (3B active per token) with up to 256K context.

    This variant ships in 2 presets tuned for different goals: **Latency** for lowest time-to-first-token, and **Throughput** for highest tokens per second. Pick the tab that matches your workload.

    <Tabs>
      <Tab title="Latency">
        This preset serves Qwen3.6-35B-A3B on H100:4 with MTP speculative decoding, optimized for low time-to-first-token on interactive chat and short-horizon agent workflows.

        <CardGroup cols={4}>
          <Card title="Hardware" icon="microchip">H100 × 4</Card>
          <Card title="Engine" icon="server">vLLM 0.20.0</Card>
          <Card title="Context" icon="ruler-horizontal">256K</Card>
          <Card title="Concurrency" icon="layer-group">64</Card>
        </CardGroup>

        ## Write the config

        Create and move into the project directory:

        ```sh theme={"system"}
        mkdir qwen3.6-35b-a3b-latency && cd qwen3.6-35b-a3b-latency
        ```

        Then create a file named `config.yaml` and paste the following:

        ```yaml config.yaml theme={"system"}
        model_name: "model:qwen3.6-35b-a3b preset:latency"

        model_metadata:
          example_model_input:
            model: "Qwen/Qwen3.6-35B-A3B"
            messages:
              - role: user
                content: "What is the capital of France?"
            stream: true
            max_tokens: 512
            temperature: 1.0
            top_p: 0.95
          tags:
            - openai-compatible

        base_image:
          image: vllm/vllm-openai:v0.20.0

        weights:
          - source: "hf://Qwen/Qwen3.6-35B-A3B@main"
            mount_location: "/app/checkpoint/qwen3.6-35b-a3b"
            auth_secret_name: "hf_access_token"

        resources:
          accelerator: H100:4
          use_gpu: true

        runtime:
          predict_concurrency: 64

        environment_variables:
          HF_HUB_ENABLE_HF_TRANSFER: "1"
          VLLM_LOGGING_LEVEL: WARNING

        secrets:
          hf_access_token: null

        docker_server:
          start_command: >-
            sh -c "vllm serve /app/checkpoint/qwen3.6-35b-a3b
            --served-model-name Qwen/Qwen3.6-35B-A3B
            --host 0.0.0.0
            --port 8000
            --trust-remote-code
            --tensor-parallel-size 4
            --max-model-len 262144
            --language-model-only
            --reasoning-parser qwen3
            --enable-auto-tool-choice
            --tool-call-parser qwen3_coder
            --speculative_config.method mtp
            --speculative_config.num_speculative_tokens 2"
          readiness_endpoint: /health
          liveness_endpoint: /health
          predict_endpoint: /v1/chat/completions
          server_port: 8000
        ```

        ## Flags

        The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

        | Flag                                          | Value         | What it does                                                                                                                                                |
        | --------------------------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
        | `--trust-remote-code`                         | (no value)    | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |
        | `--tensor-parallel-size`                      | `4`           | Number of GPUs to shard the model across.                                                                                                                   |
        | `--max-model-len`                             | `262144`      | Maximum context length (tokens) the server accepts per request.                                                                                             |
        | `--language-model-only`                       | (no value)    | Disable the multimodal path; text-only serving. Remove to enable image/video inputs.                                                                        |
        | `--reasoning-parser`                          | `qwen3`       | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
        | `--enable-auto-tool-choice`                   | (no value)    | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                        |
        | `--tool-call-parser`                          | `qwen3_coder` | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                           |
        | `--speculative_config.method`                 | `mtp`         | Speculative decoding method. **mtp:** Multi-token prediction head speculation.                                                                              |
        | `--speculative_config.num_speculative_tokens` | `2`           | Number of tokens the draft speculator proposes per step.                                                                                                    |

        ## Deploy

        Push the config to Baseten:

        ```sh theme={"system"}
        uvx truss push
        ```

        You should see output similar to:

        ```text theme={"system"}
        ✨ Model qwen3.6-35b-a3b-latency was successfully pushed ✨
        🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
        ```

        Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

        ## Call the model

        Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

        Now call your deployment to run inference:

        <Tabs>
          <Tab title="Python">
            ```python main.py theme={"system"}
            import os
            from openai import OpenAI

            client = OpenAI(
                api_key=os.environ["BASETEN_API_KEY"],
                base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
            )

            response = client.chat.completions.create(
                model="Qwen/Qwen3.6-35B-A3B",
                messages=[
                    {"role": "user", "content": "What is machine learning?"}
                ],
            )

            print(response.choices[0].message.content)
            ```
          </Tab>

          <Tab title="cURL">
            ```sh theme={"system"}
            curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
              -H "Content-Type: application/json" \
              -H "Authorization: Bearer $BASETEN_API_KEY" \
              -d '{
                "model": "Qwen/Qwen3.6-35B-A3B",
                "messages": [
                  {"role": "user", "content": "What is machine learning?"}
                ]
              }'
            ```
          </Tab>
        </Tabs>

        To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

        ```python theme={"system"}
        response = client.chat.completions.create(
            model="Qwen/Qwen3.6-35B-A3B",
            messages=[
                {"role": "user", "content": "How many r's in strawberry?"}
            ],
            extra_body={"chat_template_kwargs": {"enable_thinking": True}},
        )
        print(response.choices[0].message.reasoning_content)  # chain of thought
        print(response.choices[0].message.content)            # final answer
        ```

        To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

        ```python theme={"system"}
        tools = [{
            "type": "function",
            "function": {
                "name": "get_weather",
                "parameters": {
                    "type": "object",
                    "properties": {"location": {"type": "string"}},
                    "required": ["location"],
                },
            },
        }]

        response = client.chat.completions.create(
            model="Qwen/Qwen3.6-35B-A3B",
            messages=[
                {"role": "user", "content": "What's the weather in Paris?"}
            ],
            tools=tools,
        )
        print(response.choices[0].message.tool_calls)
        ```
      </Tab>

      <Tab title="Throughput">
        This preset serves the RedHatAI NVFP4 quantization of Qwen3.6-35B-A3B on a single B200, with FlashInfer MoE kernels, chunked prefill, and prefix caching enabled. It maximizes aggregate throughput at high concurrency.

        <CardGroup cols={4}>
          <Card title="Hardware" icon="microchip">B200</Card>
          <Card title="Engine" icon="server">vLLM (nightly build)</Card>
          <Card title="Context" icon="ruler-horizontal">256K</Card>
          <Card title="Concurrency" icon="layer-group">1000</Card>
        </CardGroup>

        ## Write the config

        Create and move into the project directory:

        ```sh theme={"system"}
        mkdir qwen3.6-35b-a3b-throughput && cd qwen3.6-35b-a3b-throughput
        ```

        Then create a file named `config.yaml` and paste the following:

        ```yaml config.yaml theme={"system"}
        model_name: "model:qwen3.6-35b-a3b preset:throughput"
        model_metadata:
          example_model_input:
            model: "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
            messages:
              - role: user
                content: "What is the capital of France?"
            max_tokens: 100
            temperature: 0.7
          tags:
            - openai-compatible
            - vllm
            - qwen3.6
            - nvfp4
            - b200
        base_image:
          image: vllm/vllm-openai:nightly
        weights:
          - source: "hf://RedHatAI/Qwen3.6-35B-A3B-NVFP4@main"
            mount_location: "/app/model_cache/qwen3.6-35b-a3b-nvfp4"
            auth_secret_name: "hf_access_token"
        build_commands: []
        environment_variables:
          PYTORCH_ALLOC_CONF: "expandable_segments:True"
          VLLM_FLASHINFER_MOE_BACKEND: throughput
          VLLM_USE_FLASHINFER_MOE_FP4: 1
          VLLM_USE_FLASHINFER_MOE_FP8: 1
        docker_server:
          start_command: >-
            vllm serve /app/model_cache/qwen3.6-35b-a3b-nvfp4
            --served-model-name RedHatAI/Qwen3.6-35B-A3B-NVFP4
            --host 0.0.0.0
            --port 8000
            --gpu-memory-utilization 0.95
            --max-model-len 262144
            --max-num-batched-tokens 32768
            --dtype auto
            --enable-chunked-prefill
            --enable-prefix-caching
            --max-num-seqs 512
            --reasoning-parser qwen3
            --enable-auto-tool-choice
            --tool-call-parser qwen3_coder
            --moe_backend flashinfer_cutlass
            --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
            --trust-remote-code
          readiness_endpoint: /health
          liveness_endpoint: /health
          predict_endpoint: /v1/chat/completions
          server_port: 8000
        runtime:
          predict_concurrency: 1000
          health_checks:
            restart_check_delay_seconds: 1500
            restart_threshold_seconds: 1500
            stop_traffic_threshold_seconds: 120
        resources:
          accelerator: B200
          use_gpu: true
        secrets:
          hf_access_token: null
        ```

        ## Flags

        The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

        | Flag                        | Value                                                 | What it does                                                                                                                                                                                 |
        | --------------------------- | ----------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
        | `--gpu-memory-utilization`  | `0.95`                                                | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                                                                |
        | `--max-model-len`           | `262144`                                              | Maximum context length (tokens) the server accepts per request.                                                                                                                              |
        | `--max-num-batched-tokens`  | `32768`                                               | Maximum total tokens processed per scheduler step.                                                                                                                                           |
        | `--dtype`                   | `auto`                                                | Weight precision loaded at runtime. **auto:** Match the model's checkpoint dtype (default).                                                                                                  |
        | `--enable-chunked-prefill`  | (no value)                                            | Process long prompts in chunks so decode requests keep running.                                                                                                                              |
        | `--enable-prefix-caching`   | (no value)                                            | Reuse KV cache across requests that share a prefix.                                                                                                                                          |
        | `--max-num-seqs`            | `512`                                                 | Maximum number of concurrent sequences in the batch.                                                                                                                                         |
        | `--reasoning-parser`        | `qwen3`                                               | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6).                                  |
        | `--enable-auto-tool-choice` | (no value)                                            | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                                                         |
        | `--tool-call-parser`        | `qwen3_coder`                                         | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                                                            |
        | `--moe_backend`             | `flashinfer_cutlass`                                  | MoE expert dispatch kernel. Engine-specific values select between routing implementations tuned for different hardware or model layouts.                                                     |
        | `--speculative-config`      | `{"method":"qwen3_5_mtp","num_speculative_tokens":3}` | Speculative decoding configuration as a JSON object. The dotted form (`--speculative-config.method`, `--speculative-config.num_speculative_tokens`, ...) sets the same fields one at a time. |
        | `--trust-remote-code`       | (no value)                                            | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                                                   |

        ## Deploy

        Push the config to Baseten:

        ```sh theme={"system"}
        uvx truss push
        ```

        You should see output similar to:

        ```text theme={"system"}
        ✨ Model qwen3.6-35b-a3b-throughput was successfully pushed ✨
        🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
        ```

        Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

        ## Call the model

        Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

        Now call your deployment to run inference:

        <Tabs>
          <Tab title="Python">
            ```python main.py theme={"system"}
            import os
            from openai import OpenAI

            client = OpenAI(
                api_key=os.environ["BASETEN_API_KEY"],
                base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
            )

            response = client.chat.completions.create(
                model="RedHatAI/Qwen3.6-35B-A3B-NVFP4",
                messages=[
                    {"role": "user", "content": "What is machine learning?"}
                ],
            )

            print(response.choices[0].message.content)
            ```
          </Tab>

          <Tab title="cURL">
            ```sh theme={"system"}
            curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
              -H "Content-Type: application/json" \
              -H "Authorization: Bearer $BASETEN_API_KEY" \
              -d '{
                "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4",
                "messages": [
                  {"role": "user", "content": "What is machine learning?"}
                ]
              }'
            ```
          </Tab>
        </Tabs>

        To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

        ```python theme={"system"}
        response = client.chat.completions.create(
            model="RedHatAI/Qwen3.6-35B-A3B-NVFP4",
            messages=[
                {"role": "user", "content": "How many r's in strawberry?"}
            ],
            extra_body={"chat_template_kwargs": {"enable_thinking": True}},
        )
        print(response.choices[0].message.reasoning_content)  # chain of thought
        print(response.choices[0].message.content)            # final answer
        ```

        To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

        ```python theme={"system"}
        tools = [{
            "type": "function",
            "function": {
                "name": "get_weather",
                "parameters": {
                    "type": "object",
                    "properties": {"location": {"type": "string"}},
                    "required": ["location"],
                },
            },
        }]

        response = client.chat.completions.create(
            model="RedHatAI/Qwen3.6-35B-A3B-NVFP4",
            messages=[
                {"role": "user", "content": "What's the weather in Paris?"}
            ],
            tools=tools,
        )
        print(response.choices[0].message.tool_calls)
        ```
      </Tab>
    </Tabs>
  </Tab>
</Tabs>
