Skip to main content

Deploy Qwen 3 30B MoE

Qwen 3 30B MoE is a next-generation Mixture-of-Experts model designed for high-throughput, low-latency inference. By activating only a subset of its 30B parameters for each token, it delivers the performance of a much larger model with the speed and efficiency of a smaller one. On Baseten, it is powered by the BIS-LLM engine, optimized for MoE architectures.

Deploy Qwen 3 30B MoE

Deploy Qwen 3 30B MoE using the BIS-LLM (V2) inference stack. This model is best served on B200 or H100 GPUs. For the 30B MoE variant, a single H100 or two L40S GPUs provide an excellent balance of cost and performance.
model_name: qwen-3-30b-moe
resources:
  accelerator: H100
  cpu: '1'
  memory: 24Gi
  use_gpu: true
trt_llm:
  inference_stack: v2
  build:
    checkpoint_repository:
      repo: Qwen/Qwen3-30B-A3B-Instruct-2507
      source: HF
    num_builder_gpus: 2
    quantization_type: fp8_kv
    max_seq_len: 40960
    tensor_parallel_count: 1
  runtime:
    enable_chunked_context: true
    request_default_max_tokens: 40960
Deploy the model with the Truss CLI:
truss push

Run inference

The BIS-LLM engine provides an OpenAI-compatible API, making it easy to swap into existing LLM applications.
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-{MODEL_ID}.api.baseten.co/environments/production/sync/v1"
)

response = client.chat.completions.create(
    model="qwen-3-moe",
    messages=[
        {"role": "user", "content": "What are the benefits of Mixture-of-Experts architectures?"}
    ],
    max_tokens=512
)

print(response.choices[0].message.content)

Configuration and tuning

Qwen 3 30B MoE excels in scenarios requiring high throughput and complex reasoning. Its MoE architecture is specifically optimized within BIS-LLM for maximum efficiency.

Hardware and quantization

We recommend using fp8_kv quantization. This optimizes both the model weights and the KV cache, allowing for longer context windows and larger batch sizes within the same memory footprint. On B200 hardware, you can leverage FP4 quantization for even greater performance gains.

MoE-specific optimizations

BIS-LLM includes custom kernels designed specifically for MoE routing and expert computation. These optimizations reduce the overhead of MoE switching, ensuring that you get the full latency benefits of the architecture without the typical “MoE tax.”