- Latency: How quickly does each user get output from the model?
- Throughput: How many requests can the deployment handle at once?
- Cost: How much does a standardized unit of work cost?
- Quality: Does your model consistently deliver high-quality output after optimization?
Performance engines
Basetenโs performance-optimized engines deliver the best possible inference speed and efficiency:Engine-Builder-LLM - Dense Models
- Best for: Llama, Mistral, Qwen, and other causal language models
- Features: TensorRT-LLM optimization, lookahead decoding, quantization
- Performance: Lowest latency and highest throughput for dense models
BIS-LLM - MoE Models
- Best for: DeepSeek, Mixtral, and other mixture-of-experts models
- Features: V2 inference stack, expert routing, structured outputs
- Performance: Optimized for large-scale MoE inference
BEI - Embedding Models
- Best for: Sentence transformers, rerankers, classification models
- Features: OpenAI-compatible, high-performance embeddings
- Performance: Fastest embedding inference with optimized batching
Performance concepts
Detailed performance optimization guides are now organized in the Performance Concepts section:- Quantization Guide - FP8/FP4 trade-offs and hardware requirements
- Structured Outputs - JSON schema validation and controlled generation
- Function Calling - Tool use and function selection
- Performance Client - High-throughput client library
- Deployment Guide - Training checkpoints and cloud storage
โก Quick performance wins
Quantization
Reduce memory usage and improve speed with post-training quantization:Lookahead Decoding
Accelerate inference for predictable content (code, JSON):Performance Client
Maximize client-side throughput with Rust-based client:๐ง Where to start
- Choose your engine: Engines overview
- Configure your model: Engine-specific configuration guides
- Optimize performance: Performance concepts
- Deploy and monitor: Use performance client for maximum throughput
๐ก Tip: Start with the default engine configuration, then apply quantization and other optimizations based on your specific performance requirements.