Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 is a 235B-parameter MoE model (22B active per token) with up to 256K context. This preset serves Qwen3-235B FP8 on H100:8 with TensorRT-LLM, optimized for low time-to-first-token on single-request reasoning at this scale.

Hardware

H100 × 8

Engine

TRT-LLM v2

Context

256K

Concurrency

256

Write the config

Create and move into the project directory:
mkdir qwen3-235b-latency && cd qwen3-235b-latency
Then create a file named config.yaml and paste the following:
config.yaml
model_metadata:
  example_model_input: # Loads sample request into Baseten playground
    messages:
      - role: system
        content: "You are a helpful assistant."
      - role: user
        content: "What does Tongyi Qianwen mean?"
    stream: false
    model: "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8"
    max_tokens: 512
    temperature: 0.6
  tags:
    - openai-compatible
  repo_id: Qwen/Qwen3-235B-A22B-Instruct-2507-FP8
model_name: "model:qwen3-235b preset:latency"
weights:
  - source: "hf://Qwen/Qwen3-235B-A22B-Instruct-2507-FP8@main"
    mount_location: "/app/model_cache/trt_model"
resources:
  accelerator: H100:8
  cpu: "1"
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    checkpoint_repository:
      repo: michaelfeil/empty-model
      revision: main
      source: HF
  inference_stack: v2
  runtime:
    enable_chunked_prefill: true
    max_batch_size: 256
    max_num_tokens: 8192
    max_seq_len: 262144
    served_model_name: Qwen/Qwen3-235B-A22B-Instruct-2507-FP8
    tensor_parallel_size: 8
    patch_kwargs:
      disable_overlap_scheduler: True
      model_path: /app/model_cache/trt_model
      moe_expert_parallel_size: 8
      cuda_graph_config:
        enable_padding: true
        max_batch_size: 256
      enable_autotune: false
      guided_decoding_backend: "xgrammar"
      enable_iter_perf_stats: 0
      kv_cache_config:
        enable_block_reuse: true
        free_gpu_memory_fraction: 0.8
  version_overrides:
    v2_llm_version: null

Key parameters

Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
ParameterValue
Tensor parallel size8
Max sequence length262144
Max batch size256
Max batched tokens8192
Chunked prefillenabled
Inference stackv2
Served model nameQwen/Qwen3-235B-A22B-Instruct-2507-FP8

Deploy

Push the config to Baseten:
uvx truss push
You should see output similar to:
✨ Model qwen3-235b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set. Now call your deployment to run inference:
main.py
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B-Instruct-2507-FP8",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)