Qwen3

Reasoning Tool calling Long context

Sparse MoE model with 235B total parameters (22B active per token). FP8-quantized checkpoint for production-scale reasoning and agentic workflows.

Setup

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

This preset serves Qwen3-235B FP8 on H100:8 with TensorRT-LLM, optimized for low time-to-first-token on single-request reasoning at this scale.

Hardware

H100 × 8

Engine

TRT-LLM v2

Context

256K

Concurrency

256

Write the config

Create and move into the project directory:

mkdir qwen3-235b-latency && cd qwen3-235b-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_metadata:
  example_model_input: # Loads sample request into Baseten playground
    messages:
      - role: system
        content: "You are a helpful assistant."
      - role: user
        content: "What does Tongyi Qianwen mean?"
    stream: false
    model: "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8"
    max_tokens: 512
    temperature: 0.6
  tags:
    - openai-compatible
  repo_id: Qwen/Qwen3-235B-A22B-Instruct-2507-FP8
model_name: "model:qwen3-235b preset:latency"
weights:
  - source: "hf://Qwen/Qwen3-235B-A22B-Instruct-2507-FP8@main"
    mount_location: "/app/model_cache/trt_model"
resources:
  accelerator: H100:8
  cpu: "1"
  memory: 10Gi
  use_gpu: true
trt_llm:
  build:
    checkpoint_repository:
      repo: michaelfeil/empty-model
      revision: main
      source: HF
  inference_stack: v2
  runtime:
    enable_chunked_prefill: true
    max_batch_size: 256
    max_num_tokens: 8192
    max_seq_len: 262144
    served_model_name: Qwen/Qwen3-235B-A22B-Instruct-2507-FP8
    tensor_parallel_size: 8
    patch_kwargs:
      disable_overlap_scheduler: True
      model_path: /app/model_cache/trt_model
      moe_expert_parallel_size: 8
      cuda_graph_config:
        enable_padding: true
        max_batch_size: 256
      enable_autotune: false
      guided_decoding_backend: "xgrammar"
      enable_iter_perf_stats: 0
      kv_cache_config:
        enable_block_reuse: true
        free_gpu_memory_fraction: 0.8
  version_overrides:
    v2_llm_version: null

Key parameters

Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Tensor parallel size	`8`
Max sequence length	`262144`
Max batch size	`256`
Max batched tokens	`8192`
Chunked prefill	`enabled`
Inference stack	`v2`
Served model name	`Qwen/Qwen3-235B-A22B-Instruct-2507-FP8`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model qwen3-235b-latency was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

truss push prints your model ID (abc1d2ef in the example). The examples below use it wherever you see {model_id}, and read your API key from the BASETEN_API_KEY environment variable.

Call the model

Your deployment serves an OpenAI-compatible API. Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B-Instruct-2507-FP8",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Context

Concurrency

Write the config

Key parameters

Deploy

Call the model

Next steps

Call your model

Autoscaling

​Setup

Hardware

Engine

Context

Concurrency

​Write the config

​Key parameters

​Deploy

​Call the model

​Next steps

Call your model

Autoscaling

Setup

Write the config

Key parameters

Deploy

Call the model

Next steps