Skip to main content

View example on GitHub

This guide walks through building an audio transcription pipeline using Chains. You’ll break down large media files, distribute transcription tasks across autoscaling deployments, and leverage high-performance GPUs for rapid inference.

1. Overview

This Chain enables fast, high-quality transcription by:
  • Partitioning long files (10+ hours) into smaller segments.
  • Detecting silence to optimize split points.
  • Parallelizing inference across multiple GPU-backed deployments.
  • Batching requests to maximize throughput.
  • Using range downloads for efficient data streaming.
  • Leveraging asyncio for concurrent execution.

2. Chain Structure

Transcription is divided into two processing layers:
  1. Macro chunks: Large segments (~300s) split from the source media file. These are processed in parallel to handle massive files efficiently.
  2. Micro chunks: Smaller segments (~5–30s) extracted from macro chunks and sent to the Whisper model for transcription.

3. Implementing the Chainlets

Transcribe (Entrypoint Chainlet)

Handles transcription requests and dispatches tasks to worker Chainlets. Function signature:
async def run_remote(
  self,
  media_url: str,
  params: data_types.TranscribeParams
) -> data_types.TranscribeOutput:
Steps:
  • Validates that the media source supports range downloads.
  • Uses FFmpeg to extract metadata and duration.
  • Splits the file into macro chunks, optimizing split points at silent sections.
  • Dispatches macro chunk tasks to the MacroChunkWorker for processing.
  • Collects micro chunk transcriptions, merges results, and returns the final text.
Example request:
curl -X POST $INVOCATION_URL \
    -H "Authorization: Api-Key $BASETEN_API_KEY" \
    -d '<JSON_INPUT>'
{
  "media_url": "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/TearsOfSteel.mp4",
  "params": {
    "micro_chunk_size_sec": 30,
    "macro_chunk_size_sec": 300
  }
}

MacroChunkWorker (Processing Chainlet)

Processes macro chunks by:
  • Extracting relevant time segments using FFmpeg.
  • Streaming audio instead of downloading full files for low latency.
  • Splitting segments at silent points.
  • Encoding audio in base64 for efficient transfer.
  • Distributing micro chunks to the Whisper model for transcription.
This Chainlet runs in parallel with multiple instances autoscaled dynamically.

WhisperModel (Inference Model)

A separately deployed Whisper model Chainlet handles speech-to-text transcription.
  • Deployed independently to allow fast iteration on business logic without redeploying the model.
  • Used across different Chains or accessed directly as a standalone model.
  • Supports multiple environments (e.g., dev, prod) using the same instance.
Whisper can also be deployed as a standard Truss model, separate from the Chain.

4. Optimizing Performance

Even for very large files, processing time remains bounded by parallel execution.

Key performance tuning parameters:

  • micro_chunk_size_sec → Balance GPU utilization and inference latency.
  • macro_chunk_size_sec → Adjust chunk size for optimal parallelism.
  • Autoscaling settings → Tune concurrency and replica counts for load balancing.
Example speedup:
{
  "input_duration_sec": 734.26,
  "processing_duration_sec": 82.42,
  "speedup": 8.9
}

5. Deploy & Run the Chain

Deploy WhisperModel first:

truss chains push whisper_chainlet.py
Copy the invocation URL and update WHISPER_URL in transcribe.py.

Deploy the transcription Chain:

truss chains push transcribe.py

Run transcription on a sample file:

curl -X POST $INVOCATION_URL \
    -H "Authorization: Api-Key $BASETEN_API_KEY" \
    -d '<JSON_INPUT>'

Next Steps

  • Learn more about Chains.
  • Optimize GPU autoscaling for peak efficiency.
  • Extend the pipeline with custom business logic.
I