Improve your latency and throughput
Model selection
GPU selection
Inference engine
transformers
and pytorch
out of the box to serve your model. But best-in-class performance comes from using a dedicated inference engine, like:Inference server
TrussServer
, a capable general-purpose model inference server built into Truss.Quantization
fp16
, meaning that model weights and other values are represented as 16-bit floating-point numbers.Through a process called post-training quantization, you can instead run inference in a different format, like fp8
, int8
, or int4
. This has massive benefits: more teraFLOPS at lower precision means lower latency, smaller numbers being retrieved from VRAM means higher throughput, and smaller model weights means saving on cost and potentially using fewer GPUs.However, quantization can affect output quality. Thoroughly review quantized model outputs by hand and with standard checks like perplexity to ensure that the output of the quantized model matches the original.We’ve had a lot of success with fp8 for faster inference without quality loss and encourage experimenting with quantization, especially when using the TRT-LLM engine builder.Model-level optimizations
Batching (GPU concurrency)
Autoscaling
Replica-level concurrency
Network latency
Good prompts
Consistent sequence shapes
Chains for multi-step inference
Session reuse during inference