Deploy Faster Whisper V3
Deploy Faster Whisper V3
Faster Whisper V3 is a custom Python Truss that uses thefaster-whisper library. This implementation is particularly well-suited for high-throughput workloads where latency and GPU utilization are critical.
Configuration
Theconfig.yaml uses model_cache to ensure the model weights are pre-loaded into the deployment, reducing cold start times.
Model implementation
Themodel.py handles both base64-encoded audio and remote URLs. It uses a temporary file to store the audio data before passing it to the WhisperModel for transcription.
Run inference
Faster Whisper V3 accepts either a direct URL to an audio file or a base64-encoded string of the audio data.- Python SDK
- cURL
Configuration and tuning
Faster Whisper is designed for maximum performance, but there are still trade-offs to consider for your specific use case.CTranslate2 Compute Type
By default,faster-whisper uses float16 for inference on GPUs. You can further optimize performance by experimenting with different compute types (e.g., int8_float16) if your hardware supports it, which can reduce memory usage and potentially increase speed.
Beam Size vs. Speed
Thebeam_size parameter in the transcribe method controls the trade-off between transcription quality and speed. A smaller beam size (e.g., 1 or 2) will be faster, while a larger beam size (e.g., 5) can improve accuracy for complex audio at the cost of higher latency.
Related
- Whisper V3 — The standard OpenAI implementation of Whisper.
- Model APIs — Instant access to transcription without dedicated infrastructure.
- Truss examples — Source code for this optimized Truss.