Whisper with TensorRT-LLM
Build an optimized inference engine for Whisper 3
This configuration builds an inference engine to serve Whisper 3 on an A10G GPU. A similar configuration can be used for any Whisper model, including fine-tuned variants.
Whisper is an audio transcription model, not a chat model. However, it’s architecture is very similar to an LLM, so it’s supported by TensorRT-LLM.
Setup
See the end-to-end engine builder tutorial prerequisites for full setup instructions.
Configuration
Unlike the LLM examples, we are getting Whisper weights directly from OpenAI, not Hugging Face. The max_input_len
and max_output_len
parameters control the optional prompt passed to the model, not the audio file itself.
Deployment
Usage
A URL to a valid audio file (16hz/1channel/wav). For testing, try this ten-second file.
Audio files are limited to 30 seconds in length. For longer files, see building an audio transcription pipeline.
The model requires url
XOR audio
.
A base64-encoded string of a valid audio file (16hz/1channel/wav).
Audio files are limited to 30 seconds in length. For longer files, see building an audio transcription pipeline.
The model requires audio
XOR url
.
The input text prompt to guide the language model’s generation.