Gated features for BIS-LLM

BIS-LLM provides features for large-scale deployments: KV cache optimization, disaggregated serving, and specialized inference strategies.

These advanced features are not fully self-serviceable. Contact us to enable them for your organization.

Available advanced features

Routing and scaling

KV-aware routing and disaggregated serving optimize multi-replica deployments. KV-aware routing directs requests to replicas with the best cache hit potential, while disaggregated serving separates prefill and decode phases into independent clusters that scale separately. Separate prefill and decode autoscaling uses token-exact metrics to right-size each phase.

MoE optimization

WideEP (expert parallelism) distributes experts across multiple GPUs for extremely large expert counts. These features work together to maximize hardware utilization on models like DeepSeek-V3 and Qwen3MoE.

Attention and memory

DP attention for MLA (Multi-Head Latent Attention) compresses KV cache by projecting attention tensors into a compact latent space, DP attention helps to managed KV-Cache across GPU ranks, and tunes DeepSeek deployments for high throughput. DeepSparseAttention sparsifies the attention matrix based on token relevance. Distributed KV storage spreads KV cache across devices for long-context inference beyond single-device memory limits.

Speculative decoding

Speculative n-gram automata-based decoding uses automata to predict tokens from n-gram patterns without full model computation. Speculative MTP or Eagle3 decoding uses draft-model approaches to predict and verify multiple future tokens.

Kernel optimization

Zero-overlap scheduling overlaps computation and communication to hide latency. Auto-tuned kernels optimize kernel parameters for your specific hardware and model topology.

KV-aware routing

KV-aware routing directs requests to replicas with the best chance of KV cache hits, routing based on cache availability and replica utilization. KV-aware routing reduces inter-token latency by distributing load across replicas, improves time-to-first-token through cache hits on repeated queries, and increases global throughput through cache reuse.

Disaggregated serving

Disaggregated serving separates prefill and decode phases into independent clusters, allowing each to scale and be optimized independently. This architecture is particularly valuable for large MoE models. Disaggregated serving is available as a gated feature. Contact us to be paired with an engineer to discuss your needs. Disaggregated serving enables independent scaling of prefill and decode resources, isolates time-critical TTFT metrics from throughput-focused phases, and optimizes costs by right-sizing each phase for its workload.

Get started

Choose the right configuration

For advanced deployments with large MoE models and planet-scale inference, contact us. For standard deployments: Use the standard BIS-LLM configuration as documented in BIS-LLM configuration.

Model recommendations

Models that benefit from advanced features

Large MoE models:

DeepSeek-V3
Qwen3MoE
Kimi-K2
GLM-4.7
GPT-OSS

Ideal use cases:

High-throughput API services
Complex reasoning tasks
Long-context applications, including agentic coding
Planet-scale deployments

When to use standard BIS-LLM or Engine-Builder-LLM

Dense models under 70B parameters
Standard MoE models under 30B parameters
Development and testing environments
Workloads with low KV cache hit rates

BIS-LLM overview: Main engine documentation.
BIS-LLM reference config: Configuration options.
Structured outputs documentation: JSON schema validation.
Examples section: Deployment examples.

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

Gated features for BIS-LLM

Available advanced features

Routing and scaling

MoE optimization

Attention and memory

Speculative decoding

Kernel optimization

KV-aware routing

Disaggregated serving

Get started

Choose the right configuration

Model recommendations

Models that benefit from advanced features

When to use standard BIS-LLM or Engine-Builder-LLM

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

​Available advanced features

​Routing and scaling

​MoE optimization

​Attention and memory

​Speculative decoding

​Kernel optimization

​KV-aware routing

​Disaggregated serving

​Get started

​Choose the right configuration

​Model recommendations

​Models that benefit from advanced features

​When to use standard BIS-LLM or Engine-Builder-LLM

​Related

Available advanced features

Routing and scaling

MoE optimization

Attention and memory

Speculative decoding

Kernel optimization

KV-aware routing

Disaggregated serving

Get started

Choose the right configuration

Model recommendations

Models that benefit from advanced features

When to use standard BIS-LLM or Engine-Builder-LLM

Related