Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Basetenuvx truss login --browser
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 is a 235B-parameter MoE model (22B active per token) with up to 256K context.
This preset serves Qwen3-235B FP8 on H100:8 with TensorRT-LLM, optimized for low time-to-first-token on single-request reasoning at this scale.
Write the config
Create and move into the project directory:
mkdir qwen3-235b-latency && cd qwen3-235b-latency
Then create a file named config.yaml and paste the following:
model_metadata:
example_model_input: # Loads sample request into Baseten playground
messages:
- role: system
content: "You are a helpful assistant."
- role: user
content: "What does Tongyi Qianwen mean?"
stream: false
model: "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8"
max_tokens: 512
temperature: 0.6
tags:
- openai-compatible
repo_id: Qwen/Qwen3-235B-A22B-Instruct-2507-FP8
model_name: "model:qwen3-235b preset:latency"
weights:
- source: "hf://Qwen/Qwen3-235B-A22B-Instruct-2507-FP8@main"
mount_location: "/app/model_cache/trt_model"
resources:
accelerator: H100:8
cpu: "1"
memory: 10Gi
use_gpu: true
trt_llm:
build:
checkpoint_repository:
repo: michaelfeil/empty-model
revision: main
source: HF
inference_stack: v2
runtime:
enable_chunked_prefill: true
max_batch_size: 256
max_num_tokens: 8192
max_seq_len: 262144
served_model_name: Qwen/Qwen3-235B-A22B-Instruct-2507-FP8
tensor_parallel_size: 8
patch_kwargs:
disable_overlap_scheduler: True
model_path: /app/model_cache/trt_model
moe_expert_parallel_size: 8
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_autotune: false
guided_decoding_backend: "xgrammar"
enable_iter_perf_stats: 0
kv_cache_config:
enable_block_reuse: true
free_gpu_memory_fraction: 0.8
version_overrides:
v2_llm_version: null
Key parameters
Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
| Parameter | Value |
|---|
| Tensor parallel size | 8 |
| Max sequence length | 262144 |
| Max batch size | 256 |
| Max batched tokens | 8192 |
| Chunked prefill | enabled |
| Inference stack | v2 |
| Served model name | Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 |
Deploy
Push the config to Baseten:
You should see output similar to:
✨ Model qwen3-235b-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.
Now call your deployment to run inference:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen3-235B-A22B-Instruct-2507-FP8",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'