Skip to main content

Deploy Whisper V3

Whisper V3 is a powerful, multi-purpose speech recognition model trained on a massive dataset of diverse audio. It excels at transcription, translation, and language identification. On Baseten, you can deploy Whisper V3 with a custom Truss that handles audio processing and model inference on dedicated GPU hardware.

Deploy Whisper V3

Unlike LLMs that use pre-built engines, Whisper V3 is typically deployed as a custom Python Truss. This allows you to include preprocessing logic, such as using ffmpeg to handle various audio formats.

Configuration

The config.yaml specifies the hardware requirements and the system packages needed for audio processing. Whisper V3 runs efficiently on an A10G or L4 GPU.
model_name: whisper-v3
python_version: py310
requirements:
  - torch==2.4.1
  - openai-whisper==20250625
  - ffmpeg-python==0.2.0
system_packages:
  - ffmpeg
resources:
  accelerator: A10G
  cpu: '3'
  memory: 16Gi
  use_gpu: true
external_data:
  - local_data_path: weights/large-v3.pt
    url: https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt

Model implementation

The model.py file defines the Model class, which loads the Whisper weights and handles the inference request. We use ffmpeg to convert incoming audio URLs into the 16kHz monochannel waveform that Whisper expects.
import whisper
import ffmpeg
import numpy as np
import torch
from pathlib import Path

class Model:
    def __init__(self, **kwargs):
        self._data_dir = kwargs["data_dir"]
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

    def load(self):
        # Load the model from the weights downloaded via external_data
        self.model = whisper.load_model(
            Path(self._data_dir) / "weights" / "large-v3.pt", 
            self.device
        )

    def predict(self, request):
        url = request.get("url")
        # Preprocess audio using ffmpeg
        out, _ = (
            ffmpeg.input(url)
            .output("pipe:", format="wav", acodec="pcm_s16le", ac=1, ar=16000)
            .run(capture_stdout=True)
        )
        waveform = np.frombuffer(out, dtype=np.int16).astype(np.float32) / 32768.0
        
        # Run inference
        result = whisper.transcribe(self.model, torch.from_numpy(waveform))
        
        return {
            "text": result["text"],
            "language": result["language"],
            "segments": result["segments"]
        }
Deploy your Truss with:
truss push

Run inference

Because this is a custom Python model, you use the predict endpoint to send an audio URL and receive the transcription.
import baseten
import os

# Deployments of custom models use the predict method
model = baseten.deployed_model_version("{MODEL_VERSION_ID}")

response = model.predict({
    "url": "https://example.com/audio.mp3"
})

print(response["text"])

Configuration and tuning

Whisper V3 is highly versatile but can be resource-intensive for long audio files.

Latency vs. Cost

For real-time applications, you may want to use a smaller variant like medium or small if the accuracy trade-off is acceptable. These variants run significantly faster and can be deployed on smaller, cheaper GPUs like the T4.

Preprocessing

Using ffmpeg inside the predict method allows your model to handle almost any audio or video format automatically. However, for extremely high-volume workloads, you might consider moving audio preprocessing to a separate service or a Chain to avoid blocking the GPU during the conversion phase.