Performance optimization

Model performance means optimizing every layer of your model serving infrastructure to balance four goals:

Latency: How quickly does each user get output from the model?
Throughput: How many requests can the deployment handle at once?
Cost: How much does a standardized unit of work cost?
Quality: Does your model consistently deliver high-quality output after optimization?

Performance engines

Baseten’s performance-optimized engines deliver the best possible inference speed and efficiency:

Engine-Builder-LLM - Dense Models

Best for: Llama, Mistral, Qwen, and other causal language models
Features: TensorRT-LLM optimization, lookahead decoding, quantization
Performance: Lowest latency and highest throughput for dense models

BIS-LLM - MoE Models

Best for: DeepSeek, Mixtral, and other mixture-of-experts models
Features: V2 inference stack, expert routing, structured outputs
Performance: Optimized for large-scale MoE inference

BEI - Embedding Models

Best for: Sentence transformers, rerankers, classification models
Features: OpenAI-compatible, high-performance embeddings
Performance: Fastest embedding inference with optimized batching

Performance concepts

Detailed performance optimization guides are now organized in the Performance Concepts section:

Quantization Guide - FP8/FP4 trade-offs and hardware requirements
Structured Outputs - JSON schema validation and controlled generation
Function Calling - Tool use and function selection
Performance Client - High-throughput client library
Deployment Guide - Training checkpoints and cloud storage

Quick performance wins

Quantization

Reduce memory usage and improve speed with post-training quantization:

trt_llm:
  build:
    quantization_type: fp8  # 50% memory reduction

Lookahead decoding

Accelerate inference for predictable content (code, JSON):

trt_llm:
  build:
    speculator:
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 5

Performance client

Maximize client-side throughput with Rust-based client:

uv pip install baseten-performance-client

Where to start

Choose your engine: Engines overview
Configure your model: Engine-specific configuration guides
Optimize performance: Performance concepts
Deploy and monitor: Use performance client for maximum throughput

Start with the default engine configuration, then apply quantization and other optimizations based on your specific performance requirements.

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

Performance optimization

Performance engines

Engine-Builder-LLM - Dense Models

BIS-LLM - MoE Models

BEI - Embedding Models

Performance concepts

Quick performance wins

Quantization

Lookahead decoding

Performance client

Where to start

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

​Performance engines

​Engine-Builder-LLM - Dense Models

​BIS-LLM - MoE Models

​BEI - Embedding Models

​Performance concepts

​Quick performance wins

​Quantization

​Lookahead decoding

​Performance client

​Where to start

Performance engines

Engine-Builder-LLM - Dense Models

BIS-LLM - MoE Models

BEI - Embedding Models

Performance concepts

Quick performance wins

Quantization

Lookahead decoding

Performance client

Where to start