Model performance overview
Improve your latency and throughput
Model performance means optimizing every layer of your model serving infrastructure to balance four goals:
- Latency: on a per-request basis, how quickly does each user get output from the model?
- Throughput: how many requests or users can the deployment handle at once?
- Cost: how much does a standardized unit of work (e.g. 1M tokens from an LLM) cost?
- Quality: does your model consistently deliver high-quality output after optimization?
Model performance tooling
TensorRT-LLM Engine Builder
Baseten’s TensorRT-LLM engine builder simplifies and automates the process of using TensorRT-LLM for development and production.
Full-stack model performance
Model and GPU selection
Two of the highest-impact choices for model perofrmance come before the optimization process: picking the best model size and implementation and picking the right GPU to run it on.
GPU-level optimizations
Our first goal is to get the best possible performance out of a single GPU or GPU cluster.
Infrastructure-level optimizations
Once we squeeze as much TPS as possible out of the GPU, we scale that out horizontally with infrastructure optimization.
Application-level optimizations
There are also application-level steps that you can take to make sure you’re getting the most value from your optimized endpoint.
Was this page helpful?