Overview and use cases
BIS-LLM is designed for MoE models and scenarios requiring the most advanced inference optimizations.Ideal for:
MoE model families:- DeepSeek:
deepseek-ai/DeepSeek-R1,deepseek-ai/DeepSeek-V3.1,deepseek-ai/DeepSeek-V3.2 - Qwen MoE:
Qwen/Qwen3-30B-A3B,Qwen/Qwen3-Coder-480B-A35B-Instruct - Kimi:
moonshotai/Kimi-K2-Instruct - GLM:
zai-org/GLM-4.7 - LLama4:
meta-llama/llama-4-maverick - GPT-OSS: Various open-source GPT variants
- High-performance inference: FP4 quantization on GB200/B200 GPUs
- Complex reasoning: Advanced tool calling and structured outputs
- Large-scale deployments: Multi-node setups and distributed inference
Forward Deployed Engineer Gated Features
We gated some more advanced features behind feature flags that we internally toggle. They are not the easiest to use, and some are mutually exclusive - making them hard to maintain on our docs page. The features below power some of the largest LLM deployments for the customer logos on our website and a couple of world-records on GPUs. For detailed information on each advanced feature, see Gated Features for BIS-LLM.Architecture support
MoE model support
BIS-LLM specifically optimizes for Mixture of Experts architectures: Primary MoE architectures:DeepseekV32ForCausalLM- DeepSeek familyQwen3MoEForCausalLM- Qwen3 MoE familyKimiK2ForCausalLM- Kimi K2 familyGlm4MoeForCausalLM- GLM MoE variantsGPTOSS- OpenAI GPT-OSS variants- …
Dense model support
While optimized for MoE, BIS-LLM also supports dense models with advanced features: Benefits for dense models:- GB200/B200 optimization: Advanced GPU kernel optimization
- FP4 quantization: Next-generation quantization support
- Enhanced memory management: Improved KV cache handling
- Models >30B parameters requiring maximum performance
- Deployments on GB200/B200 GPUs with advanced quantization
- You tried out V1 and want to compare against V2
- You want to try V2 features like KV routing or Disaggregated Serving.
- Speculation on GB200/B200
Advanced quantization
BIS-LLM supports next-generation quantization formats for maximum performance: Quantization options:no_quant: FP16/BF16 precision, or automatically uses hf_quant_config.json from modelopt if availablefp8: FP8 weights + 16-bit KV cachefp4: FP4 weights + 16-bit KV cachefp8_kv: FP8 weights + 8-bit synmetric kv cachefp4_kv: FP8 weights + 8-bit synmetric kv cachefp4_mlp_only: FP4 weights (mlp layers) + 16-bit kv-cache and attn computation
- FP4 kernels: Custom B200 kernels for maximum performance
- Memory efficiency: 75% memory reduction with FP4, some models like DeepSeekV3 strongly prefered on B200 due to kernel selection.
- Speed improvement: 4x-8x faster inference with minimal accuracy loss
- Cascaded improvments: More memory and faster inference leading to improved system performance, especially under high load.
Structured outputs and tool calling
Advanced JSON schema validation and function calling capabilities: Features:- JSON schema validation: Precise structured output generation
- Function calling: Advanced tool selection and execution
- Multi-tool support: Complex tool chains and reasoning
- Schema inheritance: Nested and complex schema support
Configuration examples
Note: The below examples are just functional examples — advanced features are frequently changing. Please reach out how to best configure a specific or fine-tuned model, we are happy to help.GPT-OSS 120B deployment
Qwen3-30B-A3B-Instruct-2507 MoE with FP4 quantization
Dense model with BIS-LLM V2
Integration examples
OpenAI-compatible inference
Advanced structured outputs
Multi-tool function calling
Best practices
Hardware selection
GPU recommendations:- B200: Best for FP4 quantization and next-gen performance
- H100: Best for FP8 quantization and production workloads
- Multi-GPU: Required for large MoE models (>30B parameters)
- Multi-Node:
| Model Size | Recommended GPU | Quantization | Tensor Parallel |
|---|---|---|---|
<30B MoE | H100:2-4 | FP8 | 2-4 |
| 30-100B MoE | H100:4-8 | FP8 | 4-8 |
| 100B+ MoE | B200:4-8 | FP4 | 4-8 |
| Dense >30B | H100:2-4 | FP8 | 2-4 |
Production best practices
V2 inference stack optimization
Configuration differences from V1
Migration guide
From Engine-Builder-LLM
V1 configuration:Key differences
inference_stack: Explicitly set tov2- Build configuration: Simplified with fewer options
- Engine configuration: Enhanced with V2-specific features
- Performance: Better optimization for MoE models
Further reading
- BIS-LLM reference config - Complete V2 configuration options
- Advanced features documentation - Enterprise features and capabilities
- Structured outputs - Advanced JSON schema validation
- Examples section - Concrete deployment examples