GPT-OSS

Reasoning Tool calling Agentic Long context

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

Pick the model you want to deploy. Each tab is a self-contained recipe.

20B
120B

openai/gpt-oss-20b is a 20B-parameter dense model with up to 128K context.This preset serves GPT-OSS 20B on a single H100 using the Harmony response format, tuned for low time-to-first-token.

Hardware

H100

Engine

TRT-LLM v2

Context

128K

Concurrency

Write the config

Create and move into the project directory:

mkdir gpt-oss-20b-latency && cd gpt-oss-20b-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:gpt-oss-20b preset:latency"
build_commands:
  - python -c 'from openai_harmony import load_harmony_encoding; load_harmony_encoding("HarmonyGptOss")'
model_metadata:
  repo_id: openai/gpt-oss-20b
  example_model_input:
    {
      "model": "openai/gpt-oss-20b",
      "messages":
        [
          {
            "role": "user",
            "content": "Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. class Solution: def twoSum(self, nums: List[int], target: int) -> List[int]:",
          },
        ],
      "stream": true,
      "max_tokens": 4096,
      "temperature": 0.5,
    }
  tags:
    - openai-compatible
resources:
  accelerator: H100
  cpu: "1"
  memory: 10Gi
  use_gpu: true
weights:
  - source: "hf://openai/gpt-oss-20b@main"
    mount_location: "/app/model_cache/trt_model"
trt_llm:
  build:
    checkpoint_repository:
      repo: michaelfeil/empty-model
      revision: main
      source: HF
  inference_stack: v2
  runtime:
    enable_chunked_prefill: true
    max_batch_size: 64
    max_num_tokens: 8192
    max_seq_len: 131072
    patch_kwargs:
      model_path: /app/model_cache/trt_model
      chat_processor: harmony
      moe_expert_parallel_size: 1
      backend: pytorch
      cuda_graph_config:
        enable_padding: true
      disable_overlap_scheduler: 1
      enable_autotuner: 0
      enable_iter_perf_stats: 0
      enable_trtllm_sampler: 1
      guided_decoding_backend: xgrammar
      kv_cache_config:
        enable_block_reuse: true
        free_gpu_memory_fraction: 0.8
        event_buffer_max_size: 1024
      max_beam_width: 1
      max_input_len: 131072
      model_level_stop_words:
        - "<|call|>"
      tokenizer_limit_length: 131072
      trust_remote_code: 1
      moe_config:
        backend: CUTLASS
    served_model_name: openai/gpt-oss-20b
    tensor_parallel_size: 1
  version_overrides:
    v2_llm_version: null

Key parameters

Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Tensor parallel size	`1`
Max sequence length	`131072`
Max batch size	`64`
Max batched tokens	`8192`
Chunked prefill	`enabled`
Inference stack	`v2`
Served model name	`openai/gpt-oss-20b`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model gpt-oss-20b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

openai/gpt-oss-120b is a 120B-parameter MoE model with up to 128K context.This variant ships in 2 presets tuned for different goals: H100 Throughput for high throughput on H100 hardware, and Throughput for highest tokens per second. Pick the tab that matches your workload.

H100 Throughput
Throughput

This preset serves GPT-OSS 120B on H100:4 for deployments that don’t have Blackwell capacity.

Hardware

H100 × 4

Engine

vLLM 0.18.0

Context

16K

Concurrency

256

Write the config

Create and move into the project directory:

mkdir gpt-oss-120b-h100-throughput && cd gpt-oss-120b-h100-throughput

Then create a file named config.yaml and paste the following:

config.yaml

model_metadata:
  example_model_input:
    messages:
      - role: system
        content: "You are a helpful assistant."
      - role: user
        content: "Write FizzBuzz in Python"
    stream: true
    model: "openai/gpt-oss-120b"
    max_tokens: 4096
    temperature: 0.5
  tags:
    - openai-compatible
model_name: "model:gpt-oss-120b preset:h100-throughput"
weights:
  - source: "hf://openai/gpt-oss-120b@b5c939de8f754692c1647ca79fbf85e8c1e70f8a"
    mount_location: "/models/gpt-oss-120b"
    ignore_patterns: ["original/*", "metal/model.bin"]
build_commands:
  - mkdir -p /opt/tiktoken
  - curl -fsSL https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken -o /opt/tiktoken/o200k_base.tiktoken
  - curl -fsSL https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -o /opt/tiktoken/cl100k_base.tiktoken
base_image:
  image: vllm/vllm-openai:v0.18.0
environment_variables:
  TIKTOKEN_ENCODINGS_BASE: "/opt/tiktoken"
  TIKTOKEN_RS_CACHE_DIR: "/opt/tiktoken"
docker_server:
  start_command: >
    sh -c "export COMPILATION_CONFIG='{\"pass_config\":{\"fuse_allreduce_rms\":true,\"eliminate_noops\":true}}' &&
    vllm serve /models/gpt-oss-120b
    --host 0.0.0.0
    --port 8000
    --served-model-name openai/gpt-oss-120b
    --tensor-parallel-size 4
    --gpu-memory-utilization 0.90
    --max-model-len 16384
    --max-num-batched-tokens 16384
    --max-num-seqs 256
    --stream-interval 20
    --enable-chunked-prefill
    --enable-prefix-caching
    --compilation-config \"$COMPILATION_CONFIG\"
    --async-scheduling
    --trust-remote-code"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
resources:
  accelerator: H100:4
  use_gpu: true
runtime:
  predict_concurrency: 256
  health_checks:
    restart_check_delay_seconds: 1500
    restart_threshold_seconds: 30
    stop_traffic_threshold_seconds: 30

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--tensor-parallel-size`	`4`	Number of GPUs to shard the model across.
`--gpu-memory-utilization`	`0.90`	Fraction of GPU memory vLLM may use for weights and KV cache.
`--max-model-len`	`16384`	Maximum context length (tokens) the server accepts per request.
`--max-num-batched-tokens`	`16384`	Maximum total tokens processed per scheduler step.
`--max-num-seqs`	`256`	Maximum number of concurrent sequences in the batch.
`--stream-interval`	`20`	Tokens emitted per streaming chunk.
`--enable-chunked-prefill`	(no value)	Process long prompts in chunks so decode requests keep running.
`--enable-prefix-caching`	(no value)	Reuse KV cache across requests that share a prefix.
`--compilation-config`	`$COMPILATION_CONFIG`	vLLM compilation passes (op fusion, dead-code elimination).
`--async-scheduling`	(no value)	Overlap scheduling with GPU execution to hide scheduler latency.
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model gpt-oss-120b-h100-throughput was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

This preset serves GPT-OSS 120B on B200:4 with FP8 KV cache and FlashInfer MXFP4+MXFP8 MoE kernels, optimized for maximum throughput on Blackwell.

Hardware

B200 × 4

Engine

vLLM 0.12.0

Context

Concurrency

256

Write the config

Create and move into the project directory:

mkdir gpt-oss-120b-throughput && cd gpt-oss-120b-throughput

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:gpt-oss-120b preset:throughput"
model_metadata:
  tags:
    - openai-compatible

base_image:
  # Pin instead of :latest for reproducibility
  image: vllm/vllm-openai:v0.12.0 # GPT-OSS recipe for Blackwell

# Pull Harmony/tiktoken vocab during build so runtime doesn't need network for this.
build_commands:
  - mkdir -p /opt/tiktoken
  - curl -fsSL https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken -o /opt/tiktoken/o200k_base.tiktoken
  - curl -fsSL https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -o /opt/tiktoken/cl100k_base.tiktoken

resources:
  accelerator: B200:4
  use_gpu: true

runtime:
  predict_concurrency: 256

environment_variables:
  # Blackwell GPT-OSS perf: enable FlashInfer MXFP4+MXFP8 MoE path
  VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: "1"

  # Harmony vocab location (avoids runtime download)
  TIKTOKEN_ENCODINGS_BASE: "/opt/tiktoken"
  TIKTOKEN_RS_CACHE_DIR: "/opt/tiktoken"

docker_server:
  # Standard vLLM OpenAI server port
  server_port: 8000

  # Map Baseten /predict to OpenAI-compatible chat completions.
  # (If you prefer /v1/responses, change this accordingly.)
  predict_endpoint: /v1/chat/completions
  readiness_endpoint: /health
  liveness_endpoint: /health

  # IMPORTANT: one shell command (newlines are OK only with \ continuations)
  start_command: >-
    bash -lc '
      exec vllm serve openai/gpt-oss-120b
        --host 0.0.0.0
        --port 8000
        --served-model-name gpt-oss-120b
        --tensor-parallel-size 4
        --gpu-memory-utilization 0.95
        --max-model-len 8192
        --max-num-batched-tokens 8192
        --max-num-seqs 256
        --cuda-graph-capture-size 2048
        --stream-interval 20
        --kv-cache-dtype fp8
        --compilation-config "{\"pass_config\":{\"fuse_allreduce_rms\":true,\"eliminate_noops\":true}}"
        --async-scheduling
        --trust-remote-code
    '

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--tensor-parallel-size`	`4`	Number of GPUs to shard the model across.
`--gpu-memory-utilization`	`0.95`	Fraction of GPU memory vLLM may use for weights and KV cache.
`--max-model-len`	`8192`	Maximum context length (tokens) the server accepts per request.
`--max-num-batched-tokens`	`8192`	Maximum total tokens processed per scheduler step.
`--max-num-seqs`	`256`	Maximum number of concurrent sequences in the batch.
`--cuda-graph-capture-size`	`2048`	Batch size ceiling for CUDA graph capture (improves decode latency).
`--stream-interval`	`20`	Tokens emitted per streaming chunk.
`--kv-cache-dtype`	`fp8`	KV cache numeric precision. fp8: ~2× KV cache density with negligible quality impact on most models.
`--compilation-config`	`{"pass_config":{"fuse_allreduce_rms":true,"eliminate_noops":true}}`	vLLM compilation passes (op fusion, dead-code elimination).
`--async-scheduling`	(no value)	Overlap scheduling with GPU execution to hide scheduler latency.
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model gpt-oss-120b-throughput was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678

Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{
    "model": "gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Context

Concurrency

Write the config

Key parameters

Deploy

Call the model

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Documentation Index

​Setup

Hardware

Engine

Context

Concurrency

​Write the config

​Key parameters

​Deploy

​Call the model

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Setup

Write the config

Key parameters

Deploy

Call the model

Write the config

Flags

Deploy

Call the model

Write the config

Flags

Deploy

Call the model