> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Qwen3.5

> Qwen3.5 recipes: 4 variants (4B, 9B, 35B, 122B), Dense, Hybrid MoE, and MoE architectures.

<div className="capability-pills">
  <a href="/examples/models/capabilities/reasoning" className="capability-pill">Reasoning</a>
  <a href="/examples/models/capabilities/tool-calling" className="capability-pill">Tool calling</a>
  <a href="/examples/models/capabilities/long-context" className="capability-pill">Long context</a>
  <a href="/examples/models/capabilities/agentic" className="capability-pill">Agentic</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install the OpenAI SDK**

    ```sh theme={"system"}
    uv pip install openai
    ```
  </Column>
</Columns>

Pick the model you want to deploy. Each tab is a self-contained recipe.

<Tabs>
  <Tab title="4B">
    [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) is a 4B-parameter dense model with up to 256K context.

    This preset serves Qwen3.5-4B with BF16 weights on a single H100, optimized for low time-to-first-token.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100 × 1</Card>
      <Card title="Engine" icon="server">vLLM 0.18.0</Card>
      <Card title="Context" icon="ruler-horizontal">32K</Card>
      <Card title="Concurrency" icon="layer-group">128</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir qwen3.5-4b-latency && cd qwen3.5-4b-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: "model:qwen3.5-4b preset:latency"
    model_metadata:
      example_model_input:
        model: "Qwen/Qwen3.5-4B"
        messages:
          - role: user
            content: "What is the capital of France?"
        max_tokens: 100
        temperature: 0.7
    base_image:
      image: vllm/vllm-openai:v0.18.0
    weights:
      - source: "hf://Qwen/Qwen3.5-4B@main"
        mount_location: "/app/checkpoint/qwen3.5-4b"
        auth_secret_name: "hf_access_token"
    build_commands: []
    docker_server:
      start_command: >-
        sh -c "vllm serve /app/checkpoint/qwen3.5-4b
        --served-model-name Qwen/Qwen3.5-4B
        --host 0.0.0.0
        --port 8000
        --gpu-memory-utilization 0.95
        --max-model-len 32768
        --dtype bfloat16
        --reasoning-parser qwen3
        --enable-auto-tool-choice
        --tool-call-parser qwen3_coder
        --trust-remote-code"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    environment_variables:
      HF_HUB_ENABLE_HF_TRANSFER: '1'
      VLLM_LOGGING_LEVEL: WARNING
    runtime:
      predict_concurrency: 128
    resources:
      accelerator: H100:1
      use_gpu: true
    secrets:
      hf_access_token: null
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                        | Value         | What it does                                                                                                                                                |
    | --------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | `--gpu-memory-utilization`  | `0.95`        | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                               |
    | `--max-model-len`           | `32768`       | Maximum context length (tokens) the server accepts per request.                                                                                             |
    | `--dtype`                   | `bfloat16`    | Weight precision loaded at runtime. **bfloat16:** BF16 weights, no quantization.                                                                            |
    | `--reasoning-parser`        | `qwen3`       | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
    | `--enable-auto-tool-choice` | (no value)    | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                        |
    | `--tool-call-parser`        | `qwen3_coder` | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                           |
    | `--trust-remote-code`       | (no value)    | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model qwen3.5-4b-latency was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-4B",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "Qwen/Qwen3.5-4B",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

    ```python theme={"system"}
    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-4B",
        messages=[
            {"role": "user", "content": "How many r's in strawberry?"}
        ],
        extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    )
    print(response.choices[0].message.reasoning_content)  # chain of thought
    print(response.choices[0].message.content)            # final answer
    ```

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-4B",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>

  <Tab title="9B">
    [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) is a 9B-parameter dense model with up to 256K context.

    This preset serves Qwen3.5-9B with BF16 weights on a single H100. It's the smallest dense Qwen3.5 deployment that keeps reasoning and tool calling enabled.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100 × 1</Card>
      <Card title="Engine" icon="server">vLLM 0.18.0</Card>
      <Card title="Context" icon="ruler-horizontal">32K</Card>
      <Card title="Concurrency" icon="layer-group">128</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir qwen3.5-9b-latency && cd qwen3.5-9b-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: "model:qwen3.5-9b preset:latency"
    model_metadata:
      example_model_input:
        model: "Qwen/Qwen3.5-9B"
        messages:
          - role: user
            content: "What is the capital of France?"
        max_tokens: 100
        temperature: 0.7
    base_image:
      image: vllm/vllm-openai:v0.18.0
    weights:
      - source: "hf://Qwen/Qwen3.5-9B@main"
        mount_location: "/app/checkpoint/qwen3.5-9b"
        auth_secret_name: "hf_access_token"
    build_commands: []
    docker_server:
      start_command: >-
        sh -c "vllm serve /app/checkpoint/qwen3.5-9b
        --served-model-name Qwen/Qwen3.5-9B
        --host 0.0.0.0
        --port 8000
        --gpu-memory-utilization 0.95
        --max-model-len 32768
        --dtype bfloat16
        --reasoning-parser qwen3
        --enable-auto-tool-choice
        --tool-call-parser qwen3_coder
        --trust-remote-code"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    environment_variables:
      HF_HUB_ENABLE_HF_TRANSFER: '1'
      VLLM_LOGGING_LEVEL: WARNING
    runtime:
      predict_concurrency: 128
    resources:
      accelerator: H100:1
      use_gpu: true
    secrets:
      hf_access_token: null
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                        | Value         | What it does                                                                                                                                                |
    | --------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | `--gpu-memory-utilization`  | `0.95`        | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                               |
    | `--max-model-len`           | `32768`       | Maximum context length (tokens) the server accepts per request.                                                                                             |
    | `--dtype`                   | `bfloat16`    | Weight precision loaded at runtime. **bfloat16:** BF16 weights, no quantization.                                                                            |
    | `--reasoning-parser`        | `qwen3`       | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
    | `--enable-auto-tool-choice` | (no value)    | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                        |
    | `--tool-call-parser`        | `qwen3_coder` | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                           |
    | `--trust-remote-code`       | (no value)    | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model qwen3.5-9b-latency was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-9B",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "Qwen/Qwen3.5-9B",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

    ```python theme={"system"}
    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-9B",
        messages=[
            {"role": "user", "content": "How many r's in strawberry?"}
        ],
        extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    )
    print(response.choices[0].message.reasoning_content)  # chain of thought
    print(response.choices[0].message.content)            # final answer
    ```

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-9B",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>

  <Tab title="35B">
    [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) is a 35B-parameter hybrid MoE model (3B active per token) with up to 256K context.

    This variant ships in 2 presets tuned for different goals: **Latency** for lowest time-to-first-token, and **Throughput** for highest tokens per second. Pick the tab that matches your workload.

    <Tabs>
      <Tab title="Latency">
        This preset serves Qwen3.5-35B with BF16 weights on H100:2, optimized for low time-to-first-token on interactive chat and short-horizon agent workflows.

        <CardGroup cols={4}>
          <Card title="Hardware" icon="microchip">H100 × 2</Card>
          <Card title="Engine" icon="server">vLLM 0.18.0</Card>
          <Card title="Context" icon="ruler-horizontal">32K</Card>
          <Card title="Concurrency" icon="layer-group">128</Card>
        </CardGroup>

        ## Write the config

        Create and move into the project directory:

        ```sh theme={"system"}
        mkdir qwen3.5-35b-latency && cd qwen3.5-35b-latency
        ```

        Then create a file named `config.yaml` and paste the following:

        ```yaml config.yaml theme={"system"}
        model_name: "model:qwen3.5-35b preset:latency"
        model_metadata:
          example_model_input:
            model: "Qwen/Qwen3.5-35B-A3B"
            messages:
              - role: user
                content: "What is the capital of France?"
            max_tokens: 100
            temperature: 0.7
        base_image:
          image: vllm/vllm-openai:v0.18.0
        weights:
          - source: "hf://Qwen/Qwen3.5-35B-A3B@main"
            mount_location: "/app/checkpoint/qwen3.5-35b-a3b"
            auth_secret_name: "hf_access_token"
        build_commands: []
        docker_server:
          start_command: >-
            sh -c "vllm serve /app/checkpoint/qwen3.5-35b-a3b
            --served-model-name Qwen/Qwen3.5-35B-A3B
            --host 0.0.0.0
            --port 8000
            --gpu-memory-utilization 0.95
            --max-model-len 32768
            --dtype bfloat16
            --tensor-parallel-size 2
            --reasoning-parser qwen3
            --enable-auto-tool-choice
            --tool-call-parser qwen3_coder
            --trust-remote-code"
          readiness_endpoint: /health
          liveness_endpoint: /health
          predict_endpoint: /v1/chat/completions
          server_port: 8000
        environment_variables:
          HF_HUB_ENABLE_HF_TRANSFER: '1'
          VLLM_LOGGING_LEVEL: WARNING
        runtime:
          predict_concurrency: 128
        resources:
          accelerator: H100:2
          use_gpu: true
        secrets:
          hf_access_token: null
        ```

        ## Flags

        The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

        | Flag                        | Value         | What it does                                                                                                                                                |
        | --------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
        | `--gpu-memory-utilization`  | `0.95`        | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                               |
        | `--max-model-len`           | `32768`       | Maximum context length (tokens) the server accepts per request.                                                                                             |
        | `--dtype`                   | `bfloat16`    | Weight precision loaded at runtime. **bfloat16:** BF16 weights, no quantization.                                                                            |
        | `--tensor-parallel-size`    | `2`           | Number of GPUs to shard the model across.                                                                                                                   |
        | `--reasoning-parser`        | `qwen3`       | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
        | `--enable-auto-tool-choice` | (no value)    | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                        |
        | `--tool-call-parser`        | `qwen3_coder` | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                           |
        | `--trust-remote-code`       | (no value)    | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |

        ## Deploy

        Push the config to Baseten:

        ```sh theme={"system"}
        uvx truss push
        ```

        You should see output similar to:

        ```text theme={"system"}
        ✨ Model qwen3.5-35b-latency was successfully pushed ✨
        🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
        ```

        Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

        ## Call the model

        Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

        Now call your deployment to run inference:

        <Tabs>
          <Tab title="Python">
            ```python main.py theme={"system"}
            import os
            from openai import OpenAI

            client = OpenAI(
                api_key=os.environ["BASETEN_API_KEY"],
                base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
            )

            response = client.chat.completions.create(
                model="Qwen/Qwen3.5-35B-A3B",
                messages=[
                    {"role": "user", "content": "What is machine learning?"}
                ],
            )

            print(response.choices[0].message.content)
            ```
          </Tab>

          <Tab title="cURL">
            ```sh theme={"system"}
            curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
              -H "Content-Type: application/json" \
              -H "Authorization: Bearer $BASETEN_API_KEY" \
              -d '{
                "model": "Qwen/Qwen3.5-35B-A3B",
                "messages": [
                  {"role": "user", "content": "What is machine learning?"}
                ]
              }'
            ```
          </Tab>
        </Tabs>

        To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

        ```python theme={"system"}
        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-35B-A3B",
            messages=[
                {"role": "user", "content": "How many r's in strawberry?"}
            ],
            extra_body={"chat_template_kwargs": {"enable_thinking": True}},
        )
        print(response.choices[0].message.reasoning_content)  # chain of thought
        print(response.choices[0].message.content)            # final answer
        ```

        To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

        ```python theme={"system"}
        tools = [{
            "type": "function",
            "function": {
                "name": "get_weather",
                "parameters": {
                    "type": "object",
                    "properties": {"location": {"type": "string"}},
                    "required": ["location"],
                },
            },
        }]

        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-35B-A3B",
            messages=[
                {"role": "user", "content": "What's the weather in Paris?"}
            ],
            tools=tools,
        )
        print(response.choices[0].message.tool_calls)
        ```
      </Tab>

      <Tab title="Throughput">
        This preset serves Qwen3.5-35B FP8 on a single B200, with prefix caching and chunked prefill enabled. It maximizes aggregate throughput at high concurrency with minor quality impact from FP8.

        <CardGroup cols={4}>
          <Card title="Hardware" icon="microchip">B200</Card>
          <Card title="Engine" icon="server">vLLM 0.18.0</Card>
          <Card title="Context" icon="ruler-horizontal">256K</Card>
          <Card title="Concurrency" icon="layer-group">1000</Card>
        </CardGroup>

        ## Write the config

        Create and move into the project directory:

        ```sh theme={"system"}
        mkdir qwen3.5-35b-throughput && cd qwen3.5-35b-throughput
        ```

        Then create a file named `config.yaml` and paste the following:

        ```yaml config.yaml theme={"system"}
        ########################################################
        # Remove ( --language-model-only ) from the start command to turn on multimodal mode
        ########################################################

        model_name: "model:qwen3.5-35b preset:throughput"
        model_metadata:
          example_model_input:
            model: "Qwen/Qwen3.5-35B-A3B-FP8"
            messages:
              - role: user
                content: "What is the capital of France?"
            max_tokens: 100
            temperature: 0.7
        base_image:
          image: vllm/vllm-openai:v0.18.0
        weights:
          - source: "hf://Qwen/Qwen3.5-35B-A3B-FP8@main"
            mount_location: "/app/model_cache/qwen3.5-35b-a3b-fp8"
            auth_secret_name: "hf_access_token"
        build_commands:
          - pip install --upgrade transformers
        environment_variables:
          VLLM_USE_FLASHINFER_MOE_FP8: "0"
          PYTORCH_ALLOC_CONF: "expandable_segments:True"
        docker_server:
          start_command: >-
            vllm serve /app/model_cache/qwen3.5-35b-a3b-fp8
            --served-model-name Qwen/Qwen3.5-35B-A3B-FP8
            --host 0.0.0.0
            --language-model-only
            --port 8000
            --gpu-memory-utilization 0.95
            --kv-cache-dtype fp8
            --reasoning-parser qwen3
            --enable-chunked-prefill
            --enable-prefix-caching
            --max-num-seqs 512
            --trust-remote-code
          readiness_endpoint: /health
          liveness_endpoint: /health
          predict_endpoint: /v1/chat/completions
          server_port: 8000
        runtime:
          predict_concurrency: 1000
          health_checks:
            restart_check_delay_seconds: 1500
            restart_threshold_seconds: 1500
            stop_traffic_threshold_seconds: 120
        resources:
          accelerator: B200
          use_gpu: true
        secrets:
          hf_access_token: null
        ```

        ## Flags

        The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

        | Flag                       | Value      | What it does                                                                                                                                                |
        | -------------------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
        | `--language-model-only`    | (no value) | Disable the multimodal path; text-only serving. Remove to enable image/video inputs.                                                                        |
        | `--gpu-memory-utilization` | `0.95`     | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                               |
        | `--kv-cache-dtype`         | `fp8`      | KV cache numeric precision. **fp8:** \~2× KV cache density with negligible quality impact on most models.                                                   |
        | `--reasoning-parser`       | `qwen3`    | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
        | `--enable-chunked-prefill` | (no value) | Process long prompts in chunks so decode requests keep running.                                                                                             |
        | `--enable-prefix-caching`  | (no value) | Reuse KV cache across requests that share a prefix.                                                                                                         |
        | `--max-num-seqs`           | `512`      | Maximum number of concurrent sequences in the batch.                                                                                                        |
        | `--trust-remote-code`      | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |

        ## Deploy

        Push the config to Baseten:

        ```sh theme={"system"}
        uvx truss push
        ```

        You should see output similar to:

        ```text theme={"system"}
        ✨ Model qwen3.5-35b-throughput was successfully pushed ✨
        🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
        ```

        Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

        ## Call the model

        Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

        Now call your deployment to run inference:

        <Tabs>
          <Tab title="Python">
            ```python main.py theme={"system"}
            import os
            from openai import OpenAI

            client = OpenAI(
                api_key=os.environ["BASETEN_API_KEY"],
                base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
            )

            response = client.chat.completions.create(
                model="Qwen/Qwen3.5-35B-A3B-FP8",
                messages=[
                    {"role": "user", "content": "What is machine learning?"}
                ],
            )

            print(response.choices[0].message.content)
            ```
          </Tab>

          <Tab title="cURL">
            ```sh theme={"system"}
            curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
              -H "Content-Type: application/json" \
              -H "Authorization: Bearer $BASETEN_API_KEY" \
              -d '{
                "model": "Qwen/Qwen3.5-35B-A3B-FP8",
                "messages": [
                  {"role": "user", "content": "What is machine learning?"}
                ]
              }'
            ```
          </Tab>
        </Tabs>

        To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

        ```python theme={"system"}
        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-35B-A3B-FP8",
            messages=[
                {"role": "user", "content": "How many r's in strawberry?"}
            ],
            extra_body={"chat_template_kwargs": {"enable_thinking": True}},
        )
        print(response.choices[0].message.reasoning_content)  # chain of thought
        print(response.choices[0].message.content)            # final answer
        ```
      </Tab>
    </Tabs>
  </Tab>

  <Tab title="122B">
    [Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B) is a 122B-parameter MoE model (10B active per token) with up to 256K context.

    This preset serves Qwen3.5-122B with BF16 weights on H100:4. It keeps time-to-first-token low while fitting the full model on a single H100 node.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100 × 4</Card>
      <Card title="Engine" icon="server">vLLM 0.18.0</Card>
      <Card title="Context" icon="ruler-horizontal">32K</Card>
      <Card title="Concurrency" icon="layer-group">128</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir qwen3.5-122b-latency && cd qwen3.5-122b-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: "model:qwen3.5-122b preset:latency"
    model_metadata:
      example_model_input:
        model: "Qwen/Qwen3.5-122B-A10B"
        messages:
          - role: user
            content: "What is the capital of France?"
        max_tokens: 100
        temperature: 0.7
    base_image:
      image: vllm/vllm-openai:v0.18.0
    weights:
      - source: "hf://Qwen/Qwen3.5-122B-A10B@main"
        mount_location: "/app/checkpoint/qwen3.5-122b-a10b"
        auth_secret_name: "hf_access_token"
    build_commands: []
    docker_server:
      start_command: >-
        sh -c "vllm serve /app/checkpoint/qwen3.5-122b-a10b
        --served-model-name Qwen/Qwen3.5-122B-A10B
        --host 0.0.0.0
        --port 8000
        --gpu-memory-utilization 0.95
        --max-model-len 32768
        --dtype bfloat16
        --tensor-parallel-size 4
        --reasoning-parser qwen3
        --enable-auto-tool-choice
        --tool-call-parser qwen3_coder
        --trust-remote-code"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    environment_variables:
      HF_HUB_ENABLE_HF_TRANSFER: '1'
      VLLM_LOGGING_LEVEL: WARNING
    runtime:
      predict_concurrency: 128
    resources:
      accelerator: H100:4
      use_gpu: true
    secrets:
      hf_access_token: null
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                        | Value         | What it does                                                                                                                                                |
    | --------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | `--gpu-memory-utilization`  | `0.95`        | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                               |
    | `--max-model-len`           | `32768`       | Maximum context length (tokens) the server accepts per request.                                                                                             |
    | `--dtype`                   | `bfloat16`    | Weight precision loaded at runtime. **bfloat16:** BF16 weights, no quantization.                                                                            |
    | `--tensor-parallel-size`    | `4`           | Number of GPUs to shard the model across.                                                                                                                   |
    | `--reasoning-parser`        | `qwen3`       | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
    | `--enable-auto-tool-choice` | (no value)    | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                        |
    | `--tool-call-parser`        | `qwen3_coder` | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                           |
    | `--trust-remote-code`       | (no value)    | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model qwen3.5-122b-latency was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-122B-A10B",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "Qwen/Qwen3.5-122B-A10B",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

    ```python theme={"system"}
    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-122B-A10B",
        messages=[
            {"role": "user", "content": "How many r's in strawberry?"}
        ],
        extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    )
    print(response.choices[0].message.reasoning_content)  # chain of thought
    print(response.choices[0].message.content)            # final answer
    ```

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-122B-A10B",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>
</Tabs>
