- Latency: How quickly does each user get output from the model?
- Throughput: How many requests can the deployment handle at once?
- Cost: How much does a standardized unit of work cost?
- Quality: Does your model consistently deliver high-quality output after optimization?
Performance engines
Baseten provides three managed inference engines. Pick the one that matches your model architecture:Engine-Builder-LLM: dense models
- Best for: Llama, Mistral, Qwen, and other causal language models.
- Features: TensorRT-LLM optimization, lookahead decoding, quantization.
- Performance: Tuned for low-latency, high-throughput dense LLM inference.
BIS-LLM: MoE models
- Best for: DeepSeek, Mixtral, and other mixture-of-experts models.
- Features: V2 inference stack, expert routing, structured outputs.
- Performance: Tuned for large-scale MoE inference.
BEI: embedding models
- Best for: Sentence transformers, rerankers, classification models.
- Features: OpenAI-compatible API, optimized batching.
- Performance: Tuned for high-throughput embedding inference.
Performance concepts
Detailed optimization guides live in the performance concepts section:- Quantization guide: FP8 and FP4 trade-offs and hardware requirements.
- Structured outputs: JSON schema validation and controlled generation.
- Function calling: tool use and function selection.
- Performance client: high-throughput client library.
- Deploy from cloud storage: GCS, S3, and Azure with Engine-Builder-LLM.
- Deploy with inference engines: Baseten Training checkpoints with TRT-LLM.
Quick performance wins
Quantization
Reduce weight memory and improve throughput with post-training quantization:config.yaml
fp8, fp8_kv, fp4, fp4_kv, fp4_mlp_only).
Lookahead decoding
Accelerate inference for predictable content like code or JSON:config.yaml
Performance client
Use the Rust-based client for high-throughput batched requests:Terminal
Where to start
- Choose your engine: Engine selection
- Configure your model: Engine-specific configuration guides
- Optimize performance: Performance concepts
- Deploy and monitor: Use performance client for maximum throughput