Engine ecosystem
BEI (Embeddings & Classification)
Embeddings, reranking, and classification models with up to 1400 embeddings/sec throughput.
Engine-Builder-LLM (Dense Models)
Dense text generation models with lookahead decoding, structured outputs, and single node inference.
BIS-LLM (MoE & Advanced)
MoE models with KV-aware routing, tool calling, and speculative decoding.
Specialized Deployments
Specialized engines for models like Whisper, Orpheus, or Flux, available as dedicated deployments rather than self-serviceable options.
Engine selection
Select an engine based on your model’s architecture and expected workload.| Model type | Architecture | Recommended engine | Key features | Hardware |
|---|---|---|---|---|
| Dense LLM | CausalLM (text generation) | Engine-Builder-LLM | Lookahead decoding, structured outputs | H100, B200 |
| MoE Models | Mixture of Experts | BIS-LLM | KV-aware routing, advanced quantization | H100, B200 |
| Large Models | 700B+ parameters | BIS-LLM | Distributed inference, FP4 support | H100, B200 |
| Embeddings | BERT-based (bidirectional) | BEI-Bert | Cold-start optimization, 16-bit precision | T4, L4, A10G, H100, B200 |
| Embeddings | Causal (Llama, Mistral, Qwen) | BEI | FP8 quantization, high throughput | L4, A10G, H100, B200 |
| Reranking | Cross-encoder architectures | BEI / BEI-Bert | Low latency, batch processing | L4, A10G, H100, B200 |
| Classification | Sequence classification | BEI / BEI-Bert | High throughput, cached weights | L4, A10G, H100, B200 |
Feature availability
| Feature | BIS-LLM | Engine-Builder-LLM | BEI | BEI-Bert | Notes |
|---|---|---|---|---|---|
| Quantization | ✅ | ✅ | ✅ | ❌ | BEI-Bert: FP16/BF16 only |
| KV quantization | ✅ | ✅ | ❌ | ❌ | FP8_KV, FP4_KV supported |
| Speculative lookahead decoding | Gated | ✅ | ❌ | ❌ | n-gram based speculation |
| Self-serviceable | Gated/✅ | ✅ | ✅ | ✅ | All engines self-service |
| KV-routing | Gated | ❌ | ❌ | ❌ | BIS-LLM only |
| Disaggregated serving | Gated | ❌ | ❌ | ❌ | BIS-LLM enterprise |
| Tool calling & structured output | ✅ | ✅ | ❌ | ❌ | Function calling support |
| Classification models | ❌ | ❌ | ✅ | ✅ | Sequence classification |
| Embedding models | ❌ | ❌ | ✅ | ✅ | Embedding generation |
| Mixture-of-experts | ✅ | ❌ | ❌ | ❌ | Mixture of Experts models like DeepSeek |
| MTP and Eagle 3 speculation | Gated | ❌ | ❌ | ❌ | Model-based speculation |
| HTTP request cancellation | ✅ | ❌ | ✅ | ✅ | Engine-Builder supports it within the first 10ms |
Architecture recommendations
BEI vs BEI-Bert (embeddings)
BEI-Bert optimizes BERT-based architectures (sentence-transformers, jinaai, nomic-ai) with fast cold-start performance and 16-bit precision. Choose BEI-Bert for bidirectional models under 4B parameters where cold-start latency matters. Jina-BERT, Nomic, and ModernBERT architectures all run well on this engine. BEI handles causal embedding architectures (Llama, Mistral, Qwen) withFP8/FP4 quantization support. Choose BEI when you need maximum throughput or want to run larger embedding models like BAAI/bge, Qwen3-Embedding, or Salesforce/SFR-Embedding with quantization.
Engine-Builder-LLM vs BIS-LLM (text generation)
Engine-Builder-LLM serves dense models (non-MoE) with lookahead decoding and structured outputs. Choose it for Llama 3.3, Qwen-3, Qwen2.5, Mistral, or Gemma-3 when you need speculative decoding for coding agents or JSON schema validation. BIS-LLM serves large MoE models with KV-aware routing and advanced tool calling. Choose it for DeepSeek-R1, Qwen3MoE, Kimi-K2, Llama-4, or GLM-4.7 when you need enterprise features like disaggregated serving or H100/B200 optimization.Performance benchmarks
Benchmark results depend on model size, GPU type, and quantization settings. The figures below represent typical performance on H100 GPUs.Embedding performance (BEI/BEI-Bert)
- Throughput: Up to 1400 client embeddings per second.
- Latency: Sub-millisecond response times.
- Quantization:
FP8/FP4provides 2x speedup with less than 1% accuracy loss.
Text generation performance (Engine-Builder-LLM/BIS-LLM)
- Speculative decoding: Faster inference for code and structured content through lookahead decoding.
- Quantization: Memory reduction and speed improvements with
FP8/FP4. - Distributed inference: Scalable deployment with tensor parallelism.
Hardware requirements and optimization
Quantization reduces memory usage and improves inference speed.| Quantization | Minimum GPU | Recommended GPU | Memory reduction | Notes |
|---|---|---|---|---|
FP16/BF16 | A100 | H100 | None | Baseline precision |
FP8 | L4 | H100 | ~50% | Good balance of performance and accuracy |
FP8_KV | L4 | H100 | ~60% | KV cache quantization for memory efficiency |
FP4 | B200 | B200 | ~75% | B200-only quantization |
FP4_KV | B200 | B200 | ~80% | Maximum memory reduction |
Some models require specialized engines that are not self-serviceable:
- Whisper: Audio transcription and speech recognition.
- Orpheus: Audio generation.
Next steps
- BEI documentation: Embeddings and classification.
- Engine-Builder-LLM documentation: Dense text generation.
- BIS-LLM documentation: MoE and advanced features.
- BEI deployment guide: Complete embedding model setup.
- TensorRT-LLM examples: Dense LLM deployment.
- DeepSeek examples: Large MoE deployment.