> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Gemma 4

> Gemma 4 recipes: 4 variants (E2B, E4B, 26B A4B, 31B), Dense and MoE architectures.

<div className="capability-pills">
  <a href="/examples/models/capabilities/reasoning" className="capability-pill">Reasoning</a>
  <a href="/examples/models/capabilities/tool-calling" className="capability-pill">Tool calling</a>
  <a href="/examples/models/capabilities/multimodal-image" className="capability-pill">Multimodal (image)</a>
  <a href="/examples/models/capabilities/long-context" className="capability-pill">Long context</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install the OpenAI SDK**

    ```sh theme={"system"}
    uv pip install openai
    ```
  </Column>
</Columns>

Pick the model you want to deploy. Each tab is a self-contained recipe.

<Tabs>
  <Tab title="E2B">
    [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) is a 2B-parameter dense model with up to 125K context.

    This preset serves Gemma 4 E2B on a single L4, the lowest-cost deployment in the Model Library.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">L4</Card>
      <Card title="Engine" icon="server">vLLM 0.20.0</Card>
      <Card title="Context" icon="ruler-horizontal">125K</Card>
      <Card title="Concurrency" icon="layer-group">8</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir gemma-4-E2B-it-latency && cd gemma-4-E2B-it-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: model:gemma-4-E2B-it preset:latency
    base_image:
      image: vllm/vllm-openai:v0.20.0
    model_metadata:
      repo_id: google/gemma-4-E2B-it
      example_model_input:
        model: google/gemma-4-E2B-it
        messages:
          - role: user
            content:
              - type: text
                text: "Describe this image in one sentence."
              - type: image_url
                image_url:
                  url: "https://picsum.photos/id/237/200/300"
        stream: true
        max_tokens: 512
        temperature: 1.0
      tags:
        - openai-compatible
    weights:
      - source: "hf://google/gemma-4-E2B-it@main"
        mount_location: "/app/checkpoint/gemma"
        auth_secret_name: "hf_access_token"
    build_commands:
      - pip install --upgrade transformers==5.5.4
    docker_server:
      start_command: >-
        sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
        --tensor-parallel-size $GPU_COUNT
        --served-model-name google/gemma-4-E2B-it
        --max-num-seqs 16
        --max-model-len auto
        --limit-mm-per-prompt.image 1
        --gpu-memory-utilization 0.9
        --async-scheduling
        --trust-remote-code
        --enable-auto-tool-choice
        --enable-prefix-caching
        --reasoning-parser gemma4
        --tool-call-parser gemma4"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    environment_variables:
      VLLM_LOGGING_LEVEL: INFO
    requirements:
      - huggingface_hub
      - hf_transfer
      - datasets
    resources:
      accelerator: L4
      use_gpu: true
    secrets:
      hf_access_token: null
    runtime:
      health_checks:
        restart_check_delay_seconds: 300
        restart_threshold_seconds: 300
        stop_traffic_threshold_seconds: 120
      predict_concurrency: 8
    # Updated with nightly image and async scheduling
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                          | Value        | What it does                                                                                               |
    | ----------------------------- | ------------ | ---------------------------------------------------------------------------------------------------------- |
    | `--tensor-parallel-size`      | `$GPU_COUNT` | Number of GPUs to shard the model across.                                                                  |
    | `--max-num-seqs`              | `16`         | Maximum number of concurrent sequences in the batch.                                                       |
    | `--max-model-len`             | `auto`       | Maximum context length (tokens) the server accepts per request.                                            |
    | `--limit-mm-per-prompt.image` | `1`          | Maximum number of image inputs per prompt.                                                                 |
    | `--gpu-memory-utilization`    | `0.9`        | Fraction of GPU memory vLLM may use for weights and KV cache.                                              |
    | `--async-scheduling`          | (no value)   | Overlap scheduling with GPU execution to hide scheduler latency.                                           |
    | `--trust-remote-code`         | (no value)   | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
    | `--enable-auto-tool-choice`   | (no value)   | Let the model choose when to call tools without requiring `tool_choice: "required"`.                       |
    | `--enable-prefix-caching`     | (no value)   | Reuse KV cache across requests that share a prefix.                                                        |
    | `--reasoning-parser`          | `gemma4`     | Server-side parser that separates reasoning output into `reasoning_content`.                               |
    | `--tool-call-parser`          | `gemma4`     | Server-side parser that emits structured `tool_calls` on the response.                                     |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model gemma-4-E2B-it-latency was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="google/gemma-4-E2B-it",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "google/gemma-4-E2B-it",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    The server parses the model's chain of thought into a separate `reasoning_content` field on the response. Read it alongside the final answer:

    ```python theme={"system"}
    response = client.chat.completions.create(
        model="google/gemma-4-E2B-it",
        messages=[
            {"role": "user", "content": "How many r's in strawberry?"}
        ],
    )
    print(response.choices[0].message.reasoning_content)  # chain of thought
    print(response.choices[0].message.content)            # final answer
    ```

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="google/gemma-4-E2B-it",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>

  <Tab title="E4B">
    [google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it) is a 4B-parameter dense model with up to 125K context.

    This preset serves Gemma 4 E4B on a single H100.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100</Card>
      <Card title="Engine" icon="server">vLLM 0.20.0</Card>
      <Card title="Context" icon="ruler-horizontal">125K</Card>
      <Card title="Concurrency" icon="layer-group">8</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir gemma-4-E4B-it-latency && cd gemma-4-E4B-it-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: model:gemma-4-E4B-it preset:latency
    base_image:
      image: vllm/vllm-openai:v0.20.0
    model_metadata:
      repo_id: google/gemma-4-E4B-it
      example_model_input:
        model: google/gemma-4-E4B-it
        messages:
          - role: user
            content:
              - type: text
                text: "Describe this image in one sentence."
              - type: image_url
                image_url:
                  url: "https://picsum.photos/id/237/200/300"
        stream: true
        max_tokens: 512
        temperature: 1.0
      tags:
        - openai-compatible
    weights:
      - source: "hf://google/gemma-4-E4B-it@main"
        mount_location: "/app/checkpoint/gemma"
        auth_secret_name: "hf_access_token"
    build_commands:
      - pip install --upgrade transformers==5.5.4
    docker_server:
      start_command: >-
        sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
        --tensor-parallel-size $GPU_COUNT
        --served-model-name google/gemma-4-E4B-it
        --max-num-seqs 16
        --max-model-len auto
        --limit-mm-per-prompt.image 1
        --gpu-memory-utilization 0.9
        --async-scheduling
        --trust-remote-code
        --enable-auto-tool-choice
        --enable-prefix-caching
        --reasoning-parser gemma4
        --tool-call-parser gemma4"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    environment_variables:
      VLLM_LOGGING_LEVEL: INFO
    requirements:
      - huggingface_hub
      - hf_transfer
      - datasets
    resources:
      accelerator: H100
      use_gpu: true
    secrets:
      hf_access_token: null
    runtime:
      health_checks:
        restart_check_delay_seconds: 300
        restart_threshold_seconds: 300
        stop_traffic_threshold_seconds: 120
      predict_concurrency: 8
    # Updated with nightly image and async scheduling
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                          | Value        | What it does                                                                                               |
    | ----------------------------- | ------------ | ---------------------------------------------------------------------------------------------------------- |
    | `--tensor-parallel-size`      | `$GPU_COUNT` | Number of GPUs to shard the model across.                                                                  |
    | `--max-num-seqs`              | `16`         | Maximum number of concurrent sequences in the batch.                                                       |
    | `--max-model-len`             | `auto`       | Maximum context length (tokens) the server accepts per request.                                            |
    | `--limit-mm-per-prompt.image` | `1`          | Maximum number of image inputs per prompt.                                                                 |
    | `--gpu-memory-utilization`    | `0.9`        | Fraction of GPU memory vLLM may use for weights and KV cache.                                              |
    | `--async-scheduling`          | (no value)   | Overlap scheduling with GPU execution to hide scheduler latency.                                           |
    | `--trust-remote-code`         | (no value)   | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
    | `--enable-auto-tool-choice`   | (no value)   | Let the model choose when to call tools without requiring `tool_choice: "required"`.                       |
    | `--enable-prefix-caching`     | (no value)   | Reuse KV cache across requests that share a prefix.                                                        |
    | `--reasoning-parser`          | `gemma4`     | Server-side parser that separates reasoning output into `reasoning_content`.                               |
    | `--tool-call-parser`          | `gemma4`     | Server-side parser that emits structured `tool_calls` on the response.                                     |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model gemma-4-E4B-it-latency was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="google/gemma-4-E4B-it",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "google/gemma-4-E4B-it",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    The server parses the model's chain of thought into a separate `reasoning_content` field on the response. Read it alongside the final answer:

    ```python theme={"system"}
    response = client.chat.completions.create(
        model="google/gemma-4-E4B-it",
        messages=[
            {"role": "user", "content": "How many r's in strawberry?"}
        ],
    )
    print(response.choices[0].message.reasoning_content)  # chain of thought
    print(response.choices[0].message.content)            # final answer
    ```

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="google/gemma-4-E4B-it",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>

  <Tab title="26B A4B">
    [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it) is a 26B-parameter MoE model (4B active per token) with up to 256K context.

    This preset serves Gemma 4 26B A4B on H100:2 with FP8 dynamic quantization.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100 × 2</Card>
      <Card title="Engine" icon="server">vLLM 0.20.0</Card>
      <Card title="Context" icon="ruler-horizontal">256K</Card>
      <Card title="Concurrency" icon="layer-group">8</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir gemma-4-26B-A4B-it-latency && cd gemma-4-26B-A4B-it-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: model:gemma-4-26B-A4B-it preset:latency
    base_image:
      image: vllm/vllm-openai:v0.20.0
    model_metadata:
      repo_id: RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic
      example_model_input:
        model: google/gemma-4-26B-A4B-it
        messages:
          - role: user
            content:
              - type: text
                text: "Describe this image in one sentence."
              - type: image_url
                image_url:
                  url: "https://picsum.photos/id/237/200/300"
        stream: true
        max_tokens: 512
        temperature: 1.0
      tags:
        - openai-compatible
    weights:
      - source: "hf://RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic@main"
        mount_location: "/app/checkpoint/gemma"
        auth_secret_name: "hf_access_token"
    build_commands:
      - pip install --upgrade transformers==5.5.4
    docker_server:
      start_command: >-
        sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
        --tensor-parallel-size $GPU_COUNT
        --served-model-name google/gemma-4-26B-A4B-it
        --max-num-seqs 16
        --max-model-len auto
        --limit-mm-per-prompt.image 1
        --gpu-memory-utilization 0.9
        --enable-prefix-caching
        --speculative-config.model RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3
        --speculative-config.num_speculative_tokens 3
        --speculative-config.method eagle3
        --trust-remote-code
        --enable-auto-tool-choice
        --reasoning-parser gemma4
        --tool-call-parser gemma4"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    environment_variables:
      VLLM_LOGGING_LEVEL: INFO
    requirements:
      - huggingface_hub
      - hf_transfer
      - datasets
    resources:
      accelerator: H100:2
      use_gpu: true
    secrets:
      hf_access_token: null
    runtime:
      health_checks:
        restart_check_delay_seconds: 300
        restart_threshold_seconds: 300
        stop_traffic_threshold_seconds: 120
      predict_concurrency: 8
    # Updated with nightly image and restored speculative decoding for latency
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                                          | Value                                           | What it does                                                                                               |
    | --------------------------------------------- | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
    | `--tensor-parallel-size`                      | `$GPU_COUNT`                                    | Number of GPUs to shard the model across.                                                                  |
    | `--max-num-seqs`                              | `16`                                            | Maximum number of concurrent sequences in the batch.                                                       |
    | `--max-model-len`                             | `auto`                                          | Maximum context length (tokens) the server accepts per request.                                            |
    | `--limit-mm-per-prompt.image`                 | `1`                                             | Maximum number of image inputs per prompt.                                                                 |
    | `--gpu-memory-utilization`                    | `0.9`                                           | Fraction of GPU memory vLLM may use for weights and KV cache.                                              |
    | `--enable-prefix-caching`                     | (no value)                                      | Reuse KV cache across requests that share a prefix.                                                        |
    | `--speculative-config.model`                  | `RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3` | Hugging Face repo for the draft speculator checkpoint.                                                     |
    | `--speculative-config.num_speculative_tokens` | `3`                                             | Number of tokens the draft speculator proposes per step.                                                   |
    | `--speculative-config.method`                 | `eagle3`                                        | Speculative decoding method. **eagle3:** EAGLE v3 speculative decoding.                                    |
    | `--trust-remote-code`                         | (no value)                                      | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
    | `--enable-auto-tool-choice`                   | (no value)                                      | Let the model choose when to call tools without requiring `tool_choice: "required"`.                       |
    | `--reasoning-parser`                          | `gemma4`                                        | Server-side parser that separates reasoning output into `reasoning_content`.                               |
    | `--tool-call-parser`                          | `gemma4`                                        | Server-side parser that emits structured `tool_calls` on the response.                                     |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model gemma-4-26B-A4B-it-latency was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="google/gemma-4-26B-A4B-it",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "google/gemma-4-26B-A4B-it",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    The server parses the model's chain of thought into a separate `reasoning_content` field on the response. Read it alongside the final answer:

    ```python theme={"system"}
    response = client.chat.completions.create(
        model="google/gemma-4-26B-A4B-it",
        messages=[
            {"role": "user", "content": "How many r's in strawberry?"}
        ],
    )
    print(response.choices[0].message.reasoning_content)  # chain of thought
    print(response.choices[0].message.content)            # final answer
    ```

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="google/gemma-4-26B-A4B-it",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>

  <Tab title="31B">
    [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) is a 31B-parameter dense model with up to 256K context.

    This preset serves Gemma 4 31B on H100:2 with FP8 block quantization.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100 × 2</Card>
      <Card title="Engine" icon="server">vLLM 0.20.0</Card>
      <Card title="Context" icon="ruler-horizontal">256K</Card>
      <Card title="Concurrency" icon="layer-group">8</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir gemma-4-31B-it-latency && cd gemma-4-31B-it-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: model:gemma-4-31B-it preset:latency
    base_image:
      image: vllm/vllm-openai:v0.20.0
    model_metadata:
      repo_id: RedHatAI/gemma-4-31B-it-FP8-block
      example_model_input:
        model: google/gemma-4-31B-it
        messages:
          - role: user
            content:
              - type: text
                text: "Describe this image in one sentence."
              - type: image_url
                image_url:
                  url: "https://picsum.photos/id/237/200/300"
        stream: true
        max_tokens: 512
        temperature: 1.0
      tags:
        - openai-compatible
    weights:
      - source: "hf://RedHatAI/gemma-4-31B-it-FP8-block@main"
        mount_location: "/app/checkpoint/gemma"
        auth_secret_name: "hf_access_token"
    build_commands:
      - pip install --upgrade transformers==5.5.4
    docker_server:
      start_command: >-
        sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/gemma
        --tensor-parallel-size $GPU_COUNT
        --served-model-name google/gemma-4-31B-it
        --max-num-seqs 16
        --max-model-len auto
        --limit-mm-per-prompt.image 1
        --gpu-memory-utilization 0.9
        --enable-prefix-caching
        --speculative-config.model RedHatAI/gemma-4-31B-it-speculator.eagle3
        --speculative-config.num_speculative_tokens 3
        --speculative-config.method eagle3
        --trust-remote-code
        --enable-auto-tool-choice
        --reasoning-parser gemma4
        --tool-call-parser gemma4"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    environment_variables:
      VLLM_LOGGING_LEVEL: INFO
    requirements:
      - huggingface_hub
      - hf_transfer
      - datasets
    resources:
      accelerator: H100:2
      use_gpu: true
    secrets:
      hf_access_token: null
    runtime:
      health_checks:
        restart_check_delay_seconds: 300
        restart_threshold_seconds: 300
        stop_traffic_threshold_seconds: 120
      predict_concurrency: 8
    # Updated with nightly image and restored speculative decoding for latency
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                                          | Value                                       | What it does                                                                                               |
    | --------------------------------------------- | ------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
    | `--tensor-parallel-size`                      | `$GPU_COUNT`                                | Number of GPUs to shard the model across.                                                                  |
    | `--max-num-seqs`                              | `16`                                        | Maximum number of concurrent sequences in the batch.                                                       |
    | `--max-model-len`                             | `auto`                                      | Maximum context length (tokens) the server accepts per request.                                            |
    | `--limit-mm-per-prompt.image`                 | `1`                                         | Maximum number of image inputs per prompt.                                                                 |
    | `--gpu-memory-utilization`                    | `0.9`                                       | Fraction of GPU memory vLLM may use for weights and KV cache.                                              |
    | `--enable-prefix-caching`                     | (no value)                                  | Reuse KV cache across requests that share a prefix.                                                        |
    | `--speculative-config.model`                  | `RedHatAI/gemma-4-31B-it-speculator.eagle3` | Hugging Face repo for the draft speculator checkpoint.                                                     |
    | `--speculative-config.num_speculative_tokens` | `3`                                         | Number of tokens the draft speculator proposes per step.                                                   |
    | `--speculative-config.method`                 | `eagle3`                                    | Speculative decoding method. **eagle3:** EAGLE v3 speculative decoding.                                    |
    | `--trust-remote-code`                         | (no value)                                  | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
    | `--enable-auto-tool-choice`                   | (no value)                                  | Let the model choose when to call tools without requiring `tool_choice: "required"`.                       |
    | `--reasoning-parser`                          | `gemma4`                                    | Server-side parser that separates reasoning output into `reasoning_content`.                               |
    | `--tool-call-parser`                          | `gemma4`                                    | Server-side parser that emits structured `tool_calls` on the response.                                     |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model gemma-4-31B-it-latency was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="google/gemma-4-31B-it",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "google/gemma-4-31B-it",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    The server parses the model's chain of thought into a separate `reasoning_content` field on the response. Read it alongside the final answer:

    ```python theme={"system"}
    response = client.chat.completions.create(
        model="google/gemma-4-31B-it",
        messages=[
            {"role": "user", "content": "How many r's in strawberry?"}
        ],
    )
    print(response.choices[0].message.reasoning_content)  # chain of thought
    print(response.choices[0].message.content)            # final answer
    ```

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="google/gemma-4-31B-it",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>
</Tabs>
