Skip to main content

Deploy Faster Whisper V3

Faster Whisper V3 is an optimized version of OpenAI’s Whisper model, re-implemented using CTranslate2. It offers significantly faster inference speeds—often 4x faster or more—while maintaining the same accuracy as the original model. On Baseten, you can deploy Faster Whisper V3 as a custom Truss for ultra-fast, cost-effective transcription.

Deploy Faster Whisper V3

Faster Whisper V3 is a custom Python Truss that uses the faster-whisper library. This implementation is particularly well-suited for high-throughput workloads where latency and GPU utilization are critical.

Configuration

The config.yaml uses model_cache to ensure the model weights are pre-loaded into the deployment, reducing cold start times.
model_name: faster-whisper-v3
python_version: py39
requirements:
  - torch==2.1.1
  - faster-whisper==1.0.3
  - ctranslate2==4.4.0
  - numpy==1.26.4
resources:
  accelerator: A10G
  cpu: 500m
  memory: 512Mi
  use_gpu: true
model_cache:
  - repo_id: Systran/faster-whisper-large-v3
    use_volume: false

Model implementation

The model.py handles both base64-encoded audio and remote URLs. It uses a temporary file to store the audio data before passing it to the WhisperModel for transcription.
import base64
from tempfile import NamedTemporaryFile
from faster_whisper import WhisperModel
import requests

class Model:
    def __init__(self, **kwargs):
        self.model_id = kwargs["config"]["model_metadata"]["model_id"]

    def load(self):
        # Load the optimized model using CTranslate2
        self.model = WhisperModel(self.model_id)

    def predict(self, request):
        # Support both URL and base64 audio
        audio_data = None
        if "url" in request:
            audio_data = requests.get(request["url"]).content
        elif "audio" in request:
            audio_data = base64.b64decode(request["audio"])

        result_segments = []
        with NamedTemporaryFile() as fp:
            fp.write(audio_data)
            segments, info = self.model.transcribe(fp.name)
            for seg in segments:
                result_segments.append({
                    "text": seg.text, 
                    "start": seg.start, 
                    "end": seg.end
                })

        return {
            "language": info.language,
            "segments": result_segments
        }
Deploy your Truss with:
truss push

Run inference

Faster Whisper V3 accepts either a direct URL to an audio file or a base64-encoded string of the audio data.
import baseten
import os

model = baseten.deployed_model_version("{MODEL_VERSION_ID}")

# Call with a URL
response = model.predict({
    "url": "https://example.com/audio.mp3"
})

print(response["segments"])

Configuration and tuning

Faster Whisper is designed for maximum performance, but there are still trade-offs to consider for your specific use case.

CTranslate2 Compute Type

By default, faster-whisper uses float16 for inference on GPUs. You can further optimize performance by experimenting with different compute types (e.g., int8_float16) if your hardware supports it, which can reduce memory usage and potentially increase speed.

Beam Size vs. Speed

The beam_size parameter in the transcribe method controls the trade-off between transcription quality and speed. A smaller beam size (e.g., 1 or 2) will be faster, while a larger beam size (e.g., 5) can improve accuracy for complex audio at the cost of higher latency.
  • Whisper V3 — The standard OpenAI implementation of Whisper.
  • Model APIs — Instant access to transcription without dedicated infrastructure.
  • Truss examples — Source code for this optimized Truss.