This configuration builds an inference engine to serve Whisper 3 on an A10G GPU. A similar configuration can be used for any Whisper model, including fine-tuned variants.

Whisper is an audio transcription model, not a chat model. However, it’s architecture is very similar to an LLM, so it’s supported by TensorRT-LLM.

Setup

See the end-to-end engine builder tutorial prerequisites for full setup instructions.

pip install --upgrade truss
truss init whisper-trt-llm
cd whisper-trt-llm
rm model/model.py

Configuration

Unlike the LLM examples, we are getting Whisper weights directly from OpenAI, not Hugging Face. The max_input_len and max_output_len parameters control the optional prompt passed to the model, not the audio file itself.

config.yaml
model_name: Whisper 3 Large Engine
resources:
  accelerator: A10G:1
  use_gpu: true
trt_llm:
  build:
    base_model: whisper
    checkpoint_repository:
      repo: https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt
      source: REMOTE_URL
    max_batch_size: 8
    max_beam_width: 1
    max_input_len: 512
    max_output_len: 256
    quantization_type: no_quant
    tensor_parallel_count: 1
    num_builder_gpus: 1

Deployment

truss push --publish

Usage

call_model.py
import requests
import os

# Model ID for production deployment
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]

# Call model endpoint
resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/production/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    json={
      "url": "https://www2.cs.uic.edu/~i101/SoundFiles/gettysburg10.wav",
    }
)

print(resp.content.decode("utf-8"))
url
string
required

A URL to a valid audio file (16hz/1channel/wav). For testing, try this ten-second file.

Audio files are limited to 30 seconds in length. For longer files, see building an audio transcription pipeline.

The model requires url XOR audio.

audio
string
required

A base64-encoded string of a valid audio file (16hz/1channel/wav).

Audio files are limited to 30 seconds in length. For longer files, see building an audio transcription pipeline.

The model requires audio XOR url.

prompt
string

The input text prompt to guide the language model’s generation.