Skip to main content
Model performance means optimizing every layer of your model serving infrastructure to balance four goals:
  1. Latency: How quickly does each user get output from the model?
  2. Throughput: How many requests can the deployment handle at once?
  3. Cost: How much does a standardized unit of work cost?
  4. Quality: Does your model consistently deliver high-quality output after optimization?

Performance engines

Basetenโ€™s performance-optimized engines deliver the best possible inference speed and efficiency:

Engine-Builder-LLM - Dense Models

  • Best for: Llama, Mistral, Qwen, and other causal language models
  • Features: TensorRT-LLM optimization, lookahead decoding, quantization
  • Performance: Lowest latency and highest throughput for dense models

BIS-LLM - MoE Models

  • Best for: DeepSeek, Mixtral, and other mixture-of-experts models
  • Features: V2 inference stack, expert routing, structured outputs
  • Performance: Optimized for large-scale MoE inference

BEI - Embedding Models

  • Best for: Sentence transformers, rerankers, classification models
  • Features: OpenAI-compatible, high-performance embeddings
  • Performance: Fastest embedding inference with optimized batching

Performance concepts

Detailed performance optimization guides are now organized in the Performance Concepts section:

โšก Quick performance wins

Quantization

Reduce memory usage and improve speed with post-training quantization:
trt_llm:
  build:
    quantization_type: fp8  # 50% memory reduction

Lookahead Decoding

Accelerate inference for predictable content (code, JSON):
trt_llm:
  build:
    speculator:
      speculative_decoding_mode: LOOKAHEAD_DECODING
      windows_size: 5

Performance Client

Maximize client-side throughput with Rust-based client:
pip install baseten-performance-client

๐Ÿ”ง Where to start

  1. Choose your engine: Engines overview
  2. Configure your model: Engine-specific configuration guides
  3. Optimize performance: Performance concepts
  4. Deploy and monitor: Use performance client for maximum throughput

๐Ÿ’ก Tip: Start with the default engine configuration, then apply quantization and other optimizations based on your specific performance requirements.