Skip to main content
Model performance means optimizing every layer of your model serving infrastructure to balance four goals:
  1. Latency: How quickly does each user get output from the model?
  2. Throughput: How many requests can the deployment handle at once?
  3. Cost: How much does a standardized unit of work cost?
  4. Quality: Does your model consistently deliver high-quality output after optimization?

Performance engines

Baseten provides three managed inference engines. Pick the one that matches your model architecture:

Engine-Builder-LLM: dense models

  • Best for: Llama, Mistral, Qwen, and other causal language models.
  • Features: TensorRT-LLM optimization, lookahead decoding, quantization.
  • Performance: Tuned for low-latency, high-throughput dense LLM inference.

BIS-LLM: MoE models

  • Best for: DeepSeek, Mixtral, and other mixture-of-experts models.
  • Features: V2 inference stack, expert routing, structured outputs.
  • Performance: Tuned for large-scale MoE inference.

BEI: embedding models

  • Best for: Sentence transformers, rerankers, classification models.
  • Features: OpenAI-compatible API, optimized batching.
  • Performance: Tuned for high-throughput embedding inference.

Performance concepts

Detailed optimization guides live in the performance concepts section:

Quick performance wins

Quantization

Reduce weight memory and improve throughput with post-training quantization:
trt_llm:
  build:
    quantization_type: fp8  # FP8 weights, 16-bit KV cache
See the quantization guide for all supported modes (fp8, fp8_kv, fp4, fp4_kv, fp4_mlp_only).

Lookahead decoding

Accelerate inference for predictable content like code or JSON:
trt_llm:
  build:
    speculator:
      speculative_decoding_mode: LOOKAHEAD_DECODING
      lookahead_windows_size: 3

Performance client

Use the Rust-based client for high-throughput batched requests:
uv pip install baseten-performance-client

Where to start

  1. Choose your engine: Engine selection
  2. Configure your model: Engine-specific configuration guides
  3. Optimize performance: Performance concepts
  4. Deploy and monitor: Use performance client for maximum throughput

Start with the default engine configuration, then apply quantization and other optimizations based on your specific performance requirements.