> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# GPT-OSS

> GPT-OSS recipes: 2 variants (20B, 120B), Dense and MoE architectures.

<div className="capability-pills">
  <a href="/examples/models/capabilities/reasoning" className="capability-pill">Reasoning</a>
  <a href="/examples/models/capabilities/tool-calling" className="capability-pill">Tool calling</a>
  <a href="/examples/models/capabilities/agentic" className="capability-pill">Agentic</a>
  <a href="/examples/models/capabilities/long-context" className="capability-pill">Long context</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install the OpenAI SDK**

    ```sh theme={"system"}
    uv pip install openai
    ```
  </Column>
</Columns>

Pick the model you want to deploy. Each tab is a self-contained recipe.

<Tabs>
  <Tab title="20B">
    [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) is a 20B-parameter dense model with up to 128K context.

    This preset serves GPT-OSS 20B on a single H100 using the Harmony response format, tuned for low time-to-first-token.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100</Card>
      <Card title="Engine" icon="server">TRT-LLM v2</Card>
      <Card title="Context" icon="ruler-horizontal">128K</Card>
      <Card title="Concurrency" icon="layer-group">64</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir gpt-oss-20b-latency && cd gpt-oss-20b-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: "model:gpt-oss-20b preset:latency"
    build_commands:
      - python -c 'from openai_harmony import load_harmony_encoding; load_harmony_encoding("HarmonyGptOss")'
    model_metadata:
      repo_id: openai/gpt-oss-20b
      example_model_input:
        {
          "model": "openai/gpt-oss-20b",
          "messages":
            [
              {
                "role": "user",
                "content": "Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. class Solution: def twoSum(self, nums: List[int], target: int) -> List[int]:",
              },
            ],
          "stream": true,
          "max_tokens": 4096,
          "temperature": 0.5,
        }
      tags:
        - openai-compatible
    resources:
      accelerator: H100
      cpu: "1"
      memory: 10Gi
      use_gpu: true
    weights:
      - source: "hf://openai/gpt-oss-20b@main"
        mount_location: "/app/model_cache/trt_model"
    trt_llm:
      build:
        checkpoint_repository:
          repo: michaelfeil/empty-model
          revision: main
          source: HF
      inference_stack: v2
      runtime:
        enable_chunked_prefill: true
        max_batch_size: 64
        max_num_tokens: 8192
        max_seq_len: 131072
        patch_kwargs:
          model_path: /app/model_cache/trt_model
          chat_processor: harmony
          moe_expert_parallel_size: 1
          backend: pytorch
          cuda_graph_config:
            enable_padding: true
          disable_overlap_scheduler: 1
          enable_autotuner: 0
          enable_iter_perf_stats: 0
          enable_trtllm_sampler: 1
          guided_decoding_backend: xgrammar
          kv_cache_config:
            enable_block_reuse: true
            free_gpu_memory_fraction: 0.8
            event_buffer_max_size: 1024
          max_beam_width: 1
          max_input_len: 131072
          model_level_stop_words:
            - "<|call|>"
          tokenizer_limit_length: 131072
          trust_remote_code: 1
          moe_config:
            backend: CUTLASS
        served_model_name: openai/gpt-oss-20b
        tensor_parallel_size: 1
      version_overrides:
        v2_llm_version: null
    ```

    ## Key parameters

    [Baseten Inference Stack](/engines/bis-llm/overview) (BIS) reads these fields from the `trt_llm` block. Each one shapes how the engine is built and served:

    | Parameter            | Value                |
    | -------------------- | -------------------- |
    | Tensor parallel size | `1`                  |
    | Max sequence length  | `131072`             |
    | Max batch size       | `64`                 |
    | Max batched tokens   | `8192`               |
    | Chunked prefill      | `enabled`            |
    | Inference stack      | `v2`                 |
    | Served model name    | `openai/gpt-oss-20b` |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```text theme={"system"}
    ✨ Model gpt-oss-20b-latency was successfully pushed ✨
    🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
    ```

    Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="openai/gpt-oss-20b",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "openai/gpt-oss-20b",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>
  </Tab>

  <Tab title="120B">
    [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) is a 120B-parameter MoE model with up to 128K context.

    This variant ships in 2 presets tuned for different goals: **H100 Throughput** for high throughput on H100 hardware, and **Throughput** for highest tokens per second. Pick the tab that matches your workload.

    <Tabs>
      <Tab title="H100 Throughput">
        This preset serves GPT-OSS 120B on H100:4 for deployments that don't have Blackwell capacity.

        <CardGroup cols={4}>
          <Card title="Hardware" icon="microchip">H100 × 4</Card>
          <Card title="Engine" icon="server">vLLM 0.18.0</Card>
          <Card title="Context" icon="ruler-horizontal">16K</Card>
          <Card title="Concurrency" icon="layer-group">256</Card>
        </CardGroup>

        ## Write the config

        Create and move into the project directory:

        ```sh theme={"system"}
        mkdir gpt-oss-120b-h100-throughput && cd gpt-oss-120b-h100-throughput
        ```

        Then create a file named `config.yaml` and paste the following:

        ```yaml config.yaml theme={"system"}
        model_metadata:
          example_model_input:
            messages:
              - role: system
                content: "You are a helpful assistant."
              - role: user
                content: "Write FizzBuzz in Python"
            stream: true
            model: "openai/gpt-oss-120b"
            max_tokens: 4096
            temperature: 0.5
          tags:
            - openai-compatible
        model_name: "model:gpt-oss-120b preset:h100-throughput"
        weights:
          - source: "hf://openai/gpt-oss-120b@b5c939de8f754692c1647ca79fbf85e8c1e70f8a"
            mount_location: "/models/gpt-oss-120b"
            ignore_patterns: ["original/*", "metal/model.bin"]
        build_commands:
          - mkdir -p /opt/tiktoken
          - curl -fsSL https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken -o /opt/tiktoken/o200k_base.tiktoken
          - curl -fsSL https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -o /opt/tiktoken/cl100k_base.tiktoken
        base_image:
          image: vllm/vllm-openai:v0.18.0
        environment_variables:
          TIKTOKEN_ENCODINGS_BASE: "/opt/tiktoken"
          TIKTOKEN_RS_CACHE_DIR: "/opt/tiktoken"
        docker_server:
          start_command: >
            sh -c "export COMPILATION_CONFIG='{\"pass_config\":{\"fuse_allreduce_rms\":true,\"eliminate_noops\":true}}' &&
            vllm serve /models/gpt-oss-120b
            --host 0.0.0.0
            --port 8000
            --served-model-name openai/gpt-oss-120b
            --tensor-parallel-size 4
            --gpu-memory-utilization 0.90
            --max-model-len 16384
            --max-num-batched-tokens 16384
            --max-num-seqs 256
            --stream-interval 20
            --enable-chunked-prefill
            --enable-prefix-caching
            --compilation-config \"$COMPILATION_CONFIG\"
            --async-scheduling
            --trust-remote-code"
          readiness_endpoint: /health
          liveness_endpoint: /health
          predict_endpoint: /v1/chat/completions
          server_port: 8000
        resources:
          accelerator: H100:4
          use_gpu: true
        runtime:
          predict_concurrency: 256
          health_checks:
            restart_check_delay_seconds: 1500
            restart_threshold_seconds: 30
            stop_traffic_threshold_seconds: 30
        ```

        ## Flags

        The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

        | Flag                       | Value                 | What it does                                                                                               |
        | -------------------------- | --------------------- | ---------------------------------------------------------------------------------------------------------- |
        | `--tensor-parallel-size`   | `4`                   | Number of GPUs to shard the model across.                                                                  |
        | `--gpu-memory-utilization` | `0.90`                | Fraction of GPU memory vLLM may use for weights and KV cache.                                              |
        | `--max-model-len`          | `16384`               | Maximum context length (tokens) the server accepts per request.                                            |
        | `--max-num-batched-tokens` | `16384`               | Maximum total tokens processed per scheduler step.                                                         |
        | `--max-num-seqs`           | `256`                 | Maximum number of concurrent sequences in the batch.                                                       |
        | `--stream-interval`        | `20`                  | Tokens emitted per streaming chunk.                                                                        |
        | `--enable-chunked-prefill` | (no value)            | Process long prompts in chunks so decode requests keep running.                                            |
        | `--enable-prefix-caching`  | (no value)            | Reuse KV cache across requests that share a prefix.                                                        |
        | `--compilation-config`     | `$COMPILATION_CONFIG` | vLLM compilation passes (op fusion, dead-code elimination).                                                |
        | `--async-scheduling`       | (no value)            | Overlap scheduling with GPU execution to hide scheduler latency.                                           |
        | `--trust-remote-code`      | (no value)            | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |

        ## Deploy

        Push the config to Baseten:

        ```sh theme={"system"}
        uvx truss push
        ```

        You should see output similar to:

        ```text theme={"system"}
        ✨ Model gpt-oss-120b-h100-throughput was successfully pushed ✨
        🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
        ```

        Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

        ## Call the model

        Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

        Now call your deployment to run inference:

        <Tabs>
          <Tab title="Python">
            ```python main.py theme={"system"}
            import os
            from openai import OpenAI

            client = OpenAI(
                api_key=os.environ["BASETEN_API_KEY"],
                base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
            )

            response = client.chat.completions.create(
                model="openai/gpt-oss-120b",
                messages=[
                    {"role": "user", "content": "What is machine learning?"}
                ],
            )

            print(response.choices[0].message.content)
            ```
          </Tab>

          <Tab title="cURL">
            ```sh theme={"system"}
            curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
              -H "Content-Type: application/json" \
              -H "Authorization: Bearer $BASETEN_API_KEY" \
              -d '{
                "model": "openai/gpt-oss-120b",
                "messages": [
                  {"role": "user", "content": "What is machine learning?"}
                ]
              }'
            ```
          </Tab>
        </Tabs>
      </Tab>

      <Tab title="Throughput">
        This preset serves GPT-OSS 120B on B200:4 with FP8 KV cache and FlashInfer MXFP4+MXFP8 MoE kernels, optimized for maximum throughput on Blackwell.

        <CardGroup cols={4}>
          <Card title="Hardware" icon="microchip">B200 × 4</Card>
          <Card title="Engine" icon="server">vLLM 0.12.0</Card>
          <Card title="Context" icon="ruler-horizontal">8K</Card>
          <Card title="Concurrency" icon="layer-group">256</Card>
        </CardGroup>

        ## Write the config

        Create and move into the project directory:

        ```sh theme={"system"}
        mkdir gpt-oss-120b-throughput && cd gpt-oss-120b-throughput
        ```

        Then create a file named `config.yaml` and paste the following:

        ```yaml config.yaml theme={"system"}
        model_name: "model:gpt-oss-120b preset:throughput"
        model_metadata:
          tags:
            - openai-compatible

        base_image:
          # Pin instead of :latest for reproducibility
          image: vllm/vllm-openai:v0.12.0 # GPT-OSS recipe for Blackwell

        # Pull Harmony/tiktoken vocab during build so runtime doesn't need network for this.
        build_commands:
          - mkdir -p /opt/tiktoken
          - curl -fsSL https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken -o /opt/tiktoken/o200k_base.tiktoken
          - curl -fsSL https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -o /opt/tiktoken/cl100k_base.tiktoken

        resources:
          accelerator: B200:4
          use_gpu: true

        runtime:
          predict_concurrency: 256

        environment_variables:
          # Blackwell GPT-OSS perf: enable FlashInfer MXFP4+MXFP8 MoE path
          VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: "1"

          # Harmony vocab location (avoids runtime download)
          TIKTOKEN_ENCODINGS_BASE: "/opt/tiktoken"
          TIKTOKEN_RS_CACHE_DIR: "/opt/tiktoken"

        docker_server:
          # Standard vLLM OpenAI server port
          server_port: 8000

          # Map Baseten /predict to OpenAI-compatible chat completions.
          # (If you prefer /v1/responses, change this accordingly.)
          predict_endpoint: /v1/chat/completions
          readiness_endpoint: /health
          liveness_endpoint: /health

          # IMPORTANT: one shell command (newlines are OK only with \ continuations)
          start_command: >-
            bash -lc '
              exec vllm serve openai/gpt-oss-120b
                --host 0.0.0.0
                --port 8000
                --served-model-name gpt-oss-120b
                --tensor-parallel-size 4
                --gpu-memory-utilization 0.95
                --max-model-len 8192
                --max-num-batched-tokens 8192
                --max-num-seqs 256
                --cuda-graph-capture-size 2048
                --stream-interval 20
                --kv-cache-dtype fp8
                --compilation-config "{\"pass_config\":{\"fuse_allreduce_rms\":true,\"eliminate_noops\":true}}"
                --async-scheduling
                --trust-remote-code
            '
        ```

        ## Flags

        The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

        | Flag                        | Value                                                                | What it does                                                                                               |
        | --------------------------- | -------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
        | `--tensor-parallel-size`    | `4`                                                                  | Number of GPUs to shard the model across.                                                                  |
        | `--gpu-memory-utilization`  | `0.95`                                                               | Fraction of GPU memory vLLM may use for weights and KV cache.                                              |
        | `--max-model-len`           | `8192`                                                               | Maximum context length (tokens) the server accepts per request.                                            |
        | `--max-num-batched-tokens`  | `8192`                                                               | Maximum total tokens processed per scheduler step.                                                         |
        | `--max-num-seqs`            | `256`                                                                | Maximum number of concurrent sequences in the batch.                                                       |
        | `--cuda-graph-capture-size` | `2048`                                                               | Batch size ceiling for CUDA graph capture (improves decode latency).                                       |
        | `--stream-interval`         | `20`                                                                 | Tokens emitted per streaming chunk.                                                                        |
        | `--kv-cache-dtype`          | `fp8`                                                                | KV cache numeric precision. **fp8:** \~2× KV cache density with negligible quality impact on most models.  |
        | `--compilation-config`      | `{"pass_config":{"fuse_allreduce_rms":true,"eliminate_noops":true}}` | vLLM compilation passes (op fusion, dead-code elimination).                                                |
        | `--async-scheduling`        | (no value)                                                           | Overlap scheduling with GPU execution to hide scheduler latency.                                           |
        | `--trust-remote-code`       | (no value)                                                           | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |

        ## Deploy

        Push the config to Baseten:

        ```sh theme={"system"}
        uvx truss push
        ```

        You should see output similar to:

        ```text theme={"system"}
        ✨ Model gpt-oss-120b-throughput was successfully pushed ✨
        🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
        ```

        Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

        ## Call the model

        Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

        Now call your deployment to run inference:

        <Tabs>
          <Tab title="Python">
            ```python main.py theme={"system"}
            import os
            from openai import OpenAI

            client = OpenAI(
                api_key=os.environ["BASETEN_API_KEY"],
                base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
            )

            response = client.chat.completions.create(
                model="gpt-oss-120b",
                messages=[
                    {"role": "user", "content": "What is machine learning?"}
                ],
            )

            print(response.choices[0].message.content)
            ```
          </Tab>

          <Tab title="cURL">
            ```sh theme={"system"}
            curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
              -H "Content-Type: application/json" \
              -H "Authorization: Bearer $BASETEN_API_KEY" \
              -d '{
                "model": "gpt-oss-120b",
                "messages": [
                  {"role": "user", "content": "What is machine learning?"}
                ]
              }'
            ```
          </Tab>
        </Tabs>
      </Tab>
    </Tabs>
  </Tab>
</Tabs>
