Model performance means optimizing every layer of your model serving infrastructure to balance four goals:

  1. Latency: on a per-request basis, how quickly does each user get output from the model?
  2. Throughput: how many requests or users can the deployment handle at once?
  3. Cost: how much does a standardized unit of work (e.g. 1M tokens from an LLM) cost?
  4. Quality: does your model consistently deliver high-quality output after optimization?

Model performance tooling

TensorRT-LLM Engine Builder

Baseten’s TensorRT-LLM engine builder simplifies and automates the process of using TensorRT-LLM for development and production.

Full-stack model performance

Model and GPU selection

Two of the highest-impact choices for model perofrmance come before the optimization process: picking the best model size and implementation and picking the right GPU to run it on.

GPU-level optimizations

Our first goal is to get the best possible performance out of a single GPU or GPU cluster.

Infrastructure-level optimizations

Once we squeeze as much TPS as possible out of the GPU, we scale that out horizontally with infrastructure optimization.

Application-level optimizations

There are also application-level steps that you can take to make sure you’re getting the most value from your optimized endpoint.