Concepts

Model performance means optimizing every layer of your model serving infrastructure to balance four goals:

Latency: on a per-request basis, how quickly does each user get output from the model?
Throughput: how many requests or users can the deployment handle at once?
Cost: how much does a standardized unit of work (e.g. 1M tokens from an LLM) cost?
Quality: does your model consistently deliver high-quality output after optimization?

Model performance tooling

TensorRT-LLM Engine Builder

Baseten’s TensorRT-LLM engine builder simplifies and automates the process of using TensorRT-LLM for development and production.

Full-stack model performance

Model and GPU selection

Two of the highest-impact choices for model performance come before the optimization process: picking the best model size and implementation and picking the right GPU to run it on.

Model selection

GPU selection

GPU-level optimizations

Our first goal is to get the best possible performance out of a single GPU or GPU cluster.

Inference engine

Inference server

Quantization

Model-level optimizations

Batching (GPU concurrency)

Infrastructure-level optimizations

Once we squeeze as much TPS as possible out of the GPU, we scale that out horizontally with infrastructure optimization.

Autoscaling

Replica-level concurrency

Network latency

Application-level optimizations

There are also application-level steps that you can take to make sure you’re getting the most value from your optimized endpoint.

Good prompts

Consistent sequence shapes

Chains for multi-step inference

Session reuse during inference

Get started

Development

Deployment

Inference

Training

Observability

Troubleshooting

Concepts

Model performance tooling

TensorRT-LLM Engine Builder

Full-stack model performance

Model and GPU selection

GPU-level optimizations

Infrastructure-level optimizations

Application-level optimizations

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

​Model performance tooling

TensorRT-LLM Engine Builder

​Full-stack model performance

​Model and GPU selection

​GPU-level optimizations

​Infrastructure-level optimizations

​Application-level optimizations

Model performance tooling

Full-stack model performance

Model and GPU selection

GPU-level optimizations

Infrastructure-level optimizations

Application-level optimizations