- BEI: Embedding, reranking, and classification models on causal architectures with
FP8andFP4quantization. - BEI-Bert: Bidirectional BEI variant tuned for BERT-family encoders and cold-start-sensitive models under 4B parameters.
- Engine-Builder-LLM: Dense text generation for Llama, Qwen, Mistral, and Gemma with lookahead decoding and multi-LoRA support.
- BIS-LLM: MoE and Enterprise serving with KV-aware routing, disaggregated prefill/decode, and Eagle/MTP speculation.
Choose an engine
Pick the row below that matches what youβre deploying. Cost, quality, and latency targets drive later choices (GPU, quantization, autoscaling) inside that engine.- Embedding, reranking, classification, or NER models: use BEI for decoder embedders (
Qwen3-Embedding,BAAI/bge,LlamaForSequenceClassification) or BEI-Bert for BERT-family encoders (BERT,ModernBERT,EuroBERT,XLM-RoBERTa). NER lives onBEI-Bert /predict_tokens. - Dense text-generation LLMs (
Llama 3or4,Qwen 3or3.5,Mistral,Gemma,Phi,GPT-OSS-20B): use Engine-Builder-LLM, with lookahead decoding and multi-LoRA available. - MoE models (
GLM 5.x,Kimi K2.5orK2.6,DeepSeek V3,R1, orV4,MiniMax 2.5,Qwen3 MoE,GPT-OSS-120B) or workloads that need KV-cache-aware routing or disaggregated prefill/decode: use BIS-LLM. Currently a co-engineering pilot. - Speech, image, video, or custom Python models: ship a custom Truss. Browse model examples for Whisper, Orpheus, Flux, and other pre-built deployments, or see build your first model for custom inference logic.
Performance and operations
- Quantization guide:
FP8andFP4trade-offs, GPU support, and per-engine options. - Autoscaling engines: Token-based and request-based scaling for engine deployments.
- Cloud storage deployment: Deploy engines from S3 or GCS instead of Hugging Face.
- Specialized model examples: Pre-built Truss examples for Whisper, Orpheus, Flux, and other dedicated deployments.
Compare engines
| Feature | BIS-LLM | Engine-Builder-LLM | BEI | BEI-Bert | Notes |
|---|---|---|---|---|---|
| Quantization | β | β | β | β | BEI-Bert: FP16/BF16 only. |
| KV quantization | β | β | β οΈ | β οΈ | FP8_KV, FP4_KV supported. |
| Lookahead decoding | β | β | β | β | Engine-Builder-LLM (v1) only; BIS-LLM uses MTP/Eagle/N-gram speculation instead. |
| Self-serviceable | π | β | β | β | BIS-LLM requires Enterprise; other engines are self-serve. |
| KV-routing | π | β | β | β | BIS-LLM only. |
| Disaggregated serving | π | β | β | β | BIS-LLM Enterprise. |
| Tool calling & structured output | β | β | β | β | Function calling support. |
| Classification models | β | β | β | β | Sequence classification. |
| Embedding models | β | β | β | β | Embedding generation. |
| Mixture-of-experts | β | β οΈ (Qwen3MoE only) | β | β | MoE models like DeepSeek-R1. |
| MTP / Eagle / N-gram speculation | π | β | β | β | v2 speculative decoding via speculative_config. |
| HTTP request cancellation | β | β οΈ | β | β | Engine-Builder-LLM: within the first 10ms only. |
| MultiModal Inputs | π | β | β οΈ | β | Selected architectures only. |