Concepts

Model performance means optimizing every layer of your model serving infrastructure to balance four goals:

Latency: on a per-request basis, how quickly does each user get output from the model?
Throughput: how many requests or users can the deployment handle at once?
Cost: how much does a standardized unit of work (e.g. 1M tokens from an LLM) cost?
Quality: does your model consistently deliver high-quality output after optimization?

Model performance tooling

TensorRT-LLM Engine Builder

Baseten’s TensorRT-LLM engine builder simplifies and automates the process of using TensorRT-LLM for development and production.

Full-stack model performance

Model and GPU selection

Two of the highest-impact choices for model performance come before the optimization process: picking the best model size and implementation and picking the right GPU to run it on.

Model selection

Tradeoff: Latency/Throughput/Cost vs QualityThe biggest factor in your latency, throughput, cost, and quality is what model you use. Before you jump into optimizing a foundation model, consider:

Can you use a smaller size, like Llama 8B instead of 70B? Can you fine-tune the smaller model for your use case?
Can you use a different model, like SDXL Lightning instead of SDXL?
Can you use a different implementation, like Faster Whisper instead of Whisper?

Usually, model selection is bound by quality. For example SDXL Lightning makes images incredibly quickly, but they may not be detailed enough for your use case.Experiment with alternative models to see if they can reset your performance expectations while meeting your quality bar.

GPU selection

Tradeoff: Latency/Throughput vs CostThe minimum requirement for a GPU instance is that it must have enough VRAM to load model weights with headroom left for inference.It often makes sense to use a more powerful (but more expensive) GPU than the minimum requirement, especially if you have ambitious latency goals and/or high utilization.For example, you might choose:

(Multiple) H100 GPUs for deployments optimized with TensorRT/TensorRT-LLM
H100 MIGs for high-throughput deployments of smaller models like Llama 3 8B and SDXL
L4 GPUs for autoscaling Whisper deployments

The GPU instance reference lists all available options.

GPU-level optimizations

Our first goal is to get the best possible performance out of a single GPU or GPU cluster.

Inference engine

Benefit: Latency/Throughput/CostYou can just use transformers and pytorch out of the box to serve your model. But best-in-class performance comes from using a dedicated inference engine, like:

TensorRT/TensorRT-LLM, maintained by NVIDIA
vLLM, an independent open source project
TGI, maintained by Hugging Face

We often recommend TensorRT/TensorRT-LLM for best performance. The easiest way to get started with TensorRT-LLM is our TRT-LLM engine builder.

Inference server

Benefit: Latency/ThroughputIn addition to an optimized inference engine, you need an inference server to handle requests and supply features like in-flight batching.Baseten runs a modified version of Triton for compatible model deployments. Other models use TrussServer, a capable general-purpose model inference server built into Truss.

Quantization

Tradeoff: Latency/Throughput/Cost vs QualityBy default, model inference happens in fp16, meaning that model weights and other values are represented as 16-bit floating-point numbers.Through a process called post-training quantization, you can instead run inference in a different format, like fp8, int8, or int4. This has massive benefits: more teraFLOPS at lower precision means lower latency, smaller numbers being retrieved from VRAM means higher throughput, and smaller model weights means saving on cost and potentially using fewer GPUs.However, quantization can affect output quality. Thoroughly review quantized model outputs by hand and with standard checks like perplexity to ensure that the output of the quantized model matches the original.We’ve had a lot of success with fp8 for faster inference without quality loss and encourage experimenting with quantization, especially when using the TRT-LLM engine builder.

Model-level optimizations

Tradeoff: Latency/Throughput/Cost vs QualityThere are a number of exciting cutting-edge techniques for model inference that can massively improve latency and/or throughput for a model. For example, LLMs can use Speculative Decoding or Medusa to generate multiple tokens per forward pass, improving TPS.When using a new technique to improve model performance, always run real-world benchmarks and carefully validate output quality to ensure the performance improvements aren’t undermining the model’s usefulness.

Batching (GPU concurrency)

Tradeoff: Latency vs Throughput/CostBatch size is how many requests are processed concurrently on the GPU. It is a direct tradeoff between latency and throughput:

Increase batch size to improve throughput and cost
Reduce batch size to improve latency

Infrastructure-level optimizations

Once we squeeze as much TPS as possible out of the GPU, we scale that out horizontally with infrastructure optimization.

Autoscaling

Tradeoff: Latency/Throughput vs CostIf traffic to a deployment is high enough, even an optimized model server won’t be able to keep up. By creating replicas, you keep latency consistent for all users.Learn more about autoscaling model replicas.

Replica-level concurrency

Tradeoff: Latency vs Throughput/CostReplica-level concurrency sets the number of requests that can be sent to the model server at one time. This is different from the on-GPU concurrency as your model server may perform pre- and post-processing tasks on CPU.Replica-level concurrency should always be greater than or equal to on-device concurrency (batch size).

Network latency

Tradeoff: Latency vs CostIf your GPU is in us-east-1 and your customer is in Australia, it doesn’t matter how much you’ve optimized TTFT — your real-world latency will be terrible.Region-specific deployments are available on a per-customer basis. Contact us at support@baseten.co to discuss your needs.

Application-level optimizations

There are also application-level steps that you can take to make sure you’re getting the most value from your optimized endpoint.

Good prompts

Benefits: Latency, QualityEvery token an LLM doesn’t have to process or generate is a token that you don’t have to wait for or pay for.Prompt engineering can be as simple as saying “be concise” or as complex as making sure your RAG system returns the minimum number of highly-relevant retrievals.

Consistent sequence shapes

Benefits: Latency, ThroughputWhen using TensorRT-LLM, make sure that your input and output sequences are a consistent length. The inference engine is built for a specific number of tokens, and going outside of those sequence shapes will hurt performance.

Chains for multi-step inference

Benefits: Latency, CostThe only thing running on your GPU should be the AI model. Other tasks like retrievals, secondary models, and business logic should be deployed and scaled separately to avoid bottlenecks.Use Chains for performant multi-step and multi-model inference.

Session reuse during inference

Benefit: LatencyUse sessions rather than individual requests to avoid unnecessary network latency. See inference documentation for details.

Get started

Development

Deployment

Inference

Training

Observability

Troubleshooting

Concepts

Model performance tooling

TensorRT-LLM Engine Builder

Full-stack model performance

Model and GPU selection

GPU-level optimizations

Infrastructure-level optimizations

Application-level optimizations

Get started

Concepts

Development

Deployment

Inference

Training

Observability

Troubleshooting

​Model performance tooling

TensorRT-LLM Engine Builder

​Full-stack model performance

​Model and GPU selection

​GPU-level optimizations

​Infrastructure-level optimizations

​Application-level optimizations

Model performance tooling

Full-stack model performance

Model and GPU selection

GPU-level optimizations

Infrastructure-level optimizations

Application-level optimizations