Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
Pick the model you want to deploy. Each tab is a self-contained recipe.
openai/gpt-oss-20b is a 20B-parameter dense model with up to 128K context.This preset serves GPT-OSS 20B on a single H100 using the Harmony response format, tuned for low time-to-first-token.

Hardware

H100

Engine

TRT-LLM v2

Context

128K

Concurrency

64

Write the config

Create and move into the project directory:
mkdir gpt-oss-20b-latency && cd gpt-oss-20b-latency
Then create a file named config.yaml and paste the following:
config.yaml
model_name: "model:gpt-oss-20b preset:latency"
build_commands:
  - python -c 'from openai_harmony import load_harmony_encoding; load_harmony_encoding("HarmonyGptOss")'
model_metadata:
  repo_id: openai/gpt-oss-20b
  example_model_input:
    {
      "model": "openai/gpt-oss-20b",
      "messages":
        [
          {
            "role": "user",
            "content": "Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. class Solution: def twoSum(self, nums: List[int], target: int) -> List[int]:",
          },
        ],
      "stream": true,
      "max_tokens": 4096,
      "temperature": 0.5,
    }
  tags:
    - openai-compatible
resources:
  accelerator: H100
  cpu: "1"
  memory: 10Gi
  use_gpu: true
weights:
  - source: "hf://openai/gpt-oss-20b@main"
    mount_location: "/app/model_cache/trt_model"
trt_llm:
  build:
    checkpoint_repository:
      repo: michaelfeil/empty-model
      revision: main
      source: HF
  inference_stack: v2
  runtime:
    enable_chunked_prefill: true
    max_batch_size: 64
    max_num_tokens: 8192
    max_seq_len: 131072
    patch_kwargs:
      model_path: /app/model_cache/trt_model
      chat_processor: harmony
      moe_expert_parallel_size: 1
      backend: pytorch
      cuda_graph_config:
        enable_padding: true
      disable_overlap_scheduler: 1
      enable_autotuner: 0
      enable_iter_perf_stats: 0
      enable_trtllm_sampler: 1
      guided_decoding_backend: xgrammar
      kv_cache_config:
        enable_block_reuse: true
        free_gpu_memory_fraction: 0.8
        event_buffer_max_size: 1024
      max_beam_width: 1
      max_input_len: 131072
      model_level_stop_words:
        - "<|call|>"
      tokenizer_limit_length: 131072
      trust_remote_code: 1
      moe_config:
        backend: CUTLASS
    served_model_name: openai/gpt-oss-20b
    tensor_parallel_size: 1
  version_overrides:
    v2_llm_version: null

Key parameters

Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
ParameterValue
Tensor parallel size1
Max sequence length131072
Max batch size64
Max batched tokens8192
Chunked prefillenabled
Inference stackv2
Served model nameopenai/gpt-oss-20b

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model gpt-oss-20b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:
main.py
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)