Improve your latency and throughput
Model performance means optimizing every layer of your model serving infrastructure to balance four goals:
Baseten’s TensorRT-LLM engine builder simplifies and automates the process of using TensorRT-LLM for development and production.
Two of the highest-impact choices for model performance come before the optimization process: picking the best model size and implementation and picking the right GPU to run it on.
Model selection
Tradeoff: Latency/Throughput/Cost vs Quality
The biggest factor in your latency, throughput, cost, and quality is what model you use. Before you jump into optimizing a foundation model, consider:
Usually, model selection is bound by quality. For example SDXL Lightning makes images incredibly quickly, but they may not be detailed enough for your use case.
Experiment with alternative models to see if they can reset your performance expectations while meeting your quality bar.
GPU selection
Tradeoff: Latency/Throughput vs Cost
The minimum requirement for a GPU instance is that it must have enough VRAM to load model weights with headroom left for inference.
It often makes sense to use a more powerful (but more expensive) GPU than the minimum requirement, especially if you have ambitious latency goals and/or high utilization.
For example, you might choose:
The GPU instance reference lists all available options.
Our first goal is to get the best possible performance out of a single GPU or GPU cluster.
Inference engine
Benefit: Latency/Throughput/Cost
You can just use transformers
and pytorch
out of the box to serve your model. But best-in-class performance comes from using a dedicated inference engine, like:
We often recommend TensorRT/TensorRT-LLM for best performance. The easiest way to get started with TensorRT-LLM is our TRT-LLM engine builder.
Inference server
Benefit: Latency/Throughput
In addition to an optimized inference engine, you need an inference server to handle requests and supply features like in-flight batching.
Baseten runs a modified version of Triton for compatible model deployments. Other models use TrussServer
, a capable general-purpose model inference server built into Truss.
Quantization
Tradeoff: Latency/Throughput/Cost vs Quality
By default, model inference happens in fp16
, meaning that model weights and other values are represented as 16-bit floating-point numbers.
Through a process called post-training quantization, you can instead run inference in a different format, like fp8
, int8
, or int4
. This has massive benefits: more teraFLOPS at lower precision means lower latency, smaller numbers being retrieved from VRAM means higher throughput, and smaller model weights means saving on cost and potentially using fewer GPUs.
However, quantization can affect output quality. Thoroughly review quantized model outputs by hand and with standard checks like perplexity to ensure that the output of the quantized model matches the original.
We’ve had a lot of success with fp8 for faster inference without quality loss and encourage experimenting with quantization, especially when using the TRT-LLM engine builder.
Model-level optimizations
Tradeoff: Latency/Throughput/Cost vs Quality
There are a number of exciting cutting-edge techniques for model inference that can massively improve latency and/or throughput for a model. For example, LLMs can use Speculative Decoding or Medusa to generate multiple tokens per forward pass, improving TPS.
When using a new technique to improve model performance, always run real-world benchmarks and carefully validate output quality to ensure the performance improvements aren’t undermining the model’s usefulness.
Batching (GPU concurrency)
Tradeoff: Latency vs Throughput/Cost
Batch size is how many requests are processed concurrently on the GPU. It is a direct tradeoff between latency and throughput:
Once we squeeze as much TPS as possible out of the GPU, we scale that out horizontally with infrastructure optimization.
Autoscaling
Tradeoff: Latency/Throughput vs Cost
If traffic to a deployment is high enough, even an optimized model server won’t be able to keep up. By creating replicas, you keep latency consistent for all users.
Learn more about autoscaling model replicas.
Replica-level concurrency
Tradeoff: Latency vs Throughput/Cost
Replica-level concurrency sets the number of requests that can be sent to the model server at one time. This is different from the on-GPU concurrency as your model server may perform pre- and post-processing tasks on CPU.
Replica-level concurrency should always be greater than or equal to on-device concurrency (batch size).
Network latency
Tradeoff: Latency vs Cost
If your GPU is in us-east-1 and your customer is in Australia, it doesn’t matter how much you’ve optimized TTFT — your real-world latency will be terrible.
Region-specific deployments are available on a per-customer basis. Contact us at support@baseten.co to discuss your needs.
There are also application-level steps that you can take to make sure you’re getting the most value from your optimized endpoint.
Good prompts
Benefits: Latency, Quality
Every token an LLM doesn’t have to process or generate is a token that you don’t have to wait for or pay for.
Prompt engineering can be as simple as saying “be concise” or as complex as making sure your RAG system returns the minimum number of highly-relevant retrievals.
Consistent sequence shapes
Benefits: Latency, Throughput
When using TensorRT-LLM, make sure that your input and output sequences are a consistent length. The inference engine is built for a specific number of tokens, and going outside of those sequence shapes will hurt performance.
Chains for multi-step inference
Benefits: Latency, Cost
The only thing running on your GPU should be the AI model. Other tasks like retrievals, secondary models, and business logic should be deployed and scaled separately to avoid bottlenecks.
Use Chains for performant multi-step and multi-model inference.
Session reuse during inference
Benefit: Latency
Use sessions rather than individual requests to avoid unnecessary network latency. See inference documentation for details.