Deploy Qwen 3 30B MoE
Deploy Qwen 3 30B MoE
Deploy Qwen 3 30B MoE using the BIS-LLM (V2) inference stack. This model is best served on B200 or H100 GPUs. For the 30B MoE variant, a single H100 or two L40S GPUs provide an excellent balance of cost and performance.Run inference
The BIS-LLM engine provides an OpenAI-compatible API, making it easy to swap into existing LLM applications.- Python SDK
- cURL
Configuration and tuning
Qwen 3 30B MoE excels in scenarios requiring high throughput and complex reasoning. Its MoE architecture is specifically optimized within BIS-LLM for maximum efficiency.Hardware and quantization
We recommend usingfp8_kv quantization. This optimizes both the model weights and the KV cache, allowing for longer context windows and larger batch sizes within the same memory footprint. On B200 hardware, you can leverage FP4 quantization for even greater performance gains.
MoE-specific optimizations
BIS-LLM includes custom kernels designed specifically for MoE routing and expert computation. These optimizations reduce the overhead of MoE switching, ensuring that you get the full latency benefits of the architecture without the typical “MoE tax.”Related
- Model APIs — Instant access to Qwen models via shared endpoints.
- BIS-LLM documentation — Details on the engine powering this model.
- Truss examples — Source code for this Truss.