> ## Documentation Index > Fetch the complete documentation index at: https://docs.baseten.co/llms.txt > Use this file to discover all available pages before exploring further. # Qwen3 > Sparse MoE model with 235B total parameters (22B active per token). FP8-quantized checkpoint for production-scale reasoning and agentic workflows.

Reasoning Tool calling Long context

## Setup To get started, sign into Baseten with Truss and then install the OpenAI SDK. **Sign in to Baseten** ```sh theme={"system"} uvx truss login --browser ``` **Install the OpenAI SDK** ```sh theme={"system"} uv pip install openai ``` [Qwen/Qwen3-235B-A22B-Instruct-2507-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507-FP8) is a 235B-parameter MoE model (22B active per token) with up to 256K context. This preset serves Qwen3-235B FP8 on H100:8 with TensorRT-LLM, optimized for low time-to-first-token on single-request reasoning at this scale. H100 × 8 TRT-LLM v2 256K 256 ## Write the config Create and move into the project directory: ```sh theme={"system"} mkdir qwen3-235b-latency && cd qwen3-235b-latency ``` Then create a file named `config.yaml` and paste the following: ```yaml config.yaml theme={"system"} model_metadata: example_model_input: # Loads sample request into Baseten playground messages: - role: system content: "You are a helpful assistant." - role: user content: "What does Tongyi Qianwen mean?" stream: false model: "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8" max_tokens: 512 temperature: 0.6 tags: - openai-compatible repo_id: Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 model_name: "model:qwen3-235b preset:latency" weights: - source: "hf://Qwen/Qwen3-235B-A22B-Instruct-2507-FP8@main" mount_location: "/app/model_cache/trt_model" resources: accelerator: H100:8 cpu: "1" memory: 10Gi use_gpu: true trt_llm: build: checkpoint_repository: repo: michaelfeil/empty-model revision: main source: HF inference_stack: v2 runtime: enable_chunked_prefill: true max_batch_size: 256 max_num_tokens: 8192 max_seq_len: 262144 served_model_name: Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 tensor_parallel_size: 8 patch_kwargs: disable_overlap_scheduler: True model_path: /app/model_cache/trt_model moe_expert_parallel_size: 8 cuda_graph_config: enable_padding: true max_batch_size: 256 enable_autotune: false guided_decoding_backend: "xgrammar" enable_iter_perf_stats: 0 kv_cache_config: enable_block_reuse: true free_gpu_memory_fraction: 0.8 version_overrides: v2_llm_version: null ``` ## Key parameters [Baseten Inference Stack](/engines/bis-llm/overview) (BIS) reads these fields from the `trt_llm` block. Each one shapes how the engine is built and served: | Parameter | Value | | -------------------- | ---------------------------------------- | | Tensor parallel size | `8` | | Max sequence length | `262144` | | Max batch size | `256` | | Max batched tokens | `8192` | | Chunked prefill | `enabled` | | Inference stack | `v2` | | Served model name | `Qwen/Qwen3-235B-A22B-Instruct-2507-FP8` | ## Deploy Push the config to Baseten: ```sh theme={"system"} uvx truss push ``` You should see output similar to: ```output theme={"system"} ✨ Model qwen3-235b-latency was successfully pushed ✨ Model ID: abc1d2ef Deployment ID: xyz123 Endpoint: model-abc1d2ef.api.baseten.co Logs: https://app.baseten.co/models/abc1d2ef/logs/xyz123 ``` Your **model ID** is printed in the `truss push` output (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section. ## Call the model Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set. Now call your deployment to run inference: ```python main.py theme={"system"} import os from openai import OpenAI client = OpenAI( api_key=os.environ["BASETEN_API_KEY"], base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1", ) response = client.chat.completions.create( model="Qwen/Qwen3-235B-A22B-Instruct-2507-FP8", messages=[ {"role": "user", "content": "What is machine learning?"} ], ) print(response.choices[0].message.content) ``` ```sh theme={"system"} curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $BASETEN_API_KEY" \ -d '{ "model": "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8", "messages": [ {"role": "user", "content": "What is machine learning?"} ] }' ```