This guide walks through building an audio transcription pipeline using Chains. You’ll break down large media files, distribute transcription tasks across autoscaling deployments, and leverage high-performance GPUs for rapid inference.
1. Overview
This Chain enables fast, high-quality transcription by:
- Partitioning long files (10+ hours) into smaller segments.
- Detecting silence to optimize split points.
- Parallelizing inference across multiple GPU-backed deployments.
- Batching requests to maximize throughput.
- Using range downloads for efficient data streaming.
- Leveraging
asyncio
for concurrent execution.
2. Chain Structure
Transcription is divided into two processing layers:
- Macro chunks: Large segments (~300s) split from the source media file. These are processed in parallel to handle massive files efficiently.
- Micro chunks: Smaller segments (~5–30s) extracted from macro chunks and sent to the Whisper model for transcription.
3. Implementing the Chainlets
Transcribe
(Entrypoint Chainlet)
Handles transcription requests and dispatches tasks to worker Chainlets.
Function signature:
async def run_remote(
self,
media_url: str,
params: data_types.TranscribeParams
) -> data_types.TranscribeOutput:
Steps:
- Validates that the media source supports range downloads.
- Uses FFmpeg to extract metadata and duration.
- Splits the file into macro chunks, optimizing split points at silent sections.
- Dispatches macro chunk tasks to the MacroChunkWorker for processing.
- Collects micro chunk transcriptions, merges results, and returns the final text.
Example request:
curl -X POST $INVOCATION_URL \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '<JSON_INPUT>'
{
"media_url": "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/TearsOfSteel.mp4",
"params": {
"micro_chunk_size_sec": 30,
"macro_chunk_size_sec": 300
}
}
MacroChunkWorker
(Processing Chainlet)
Processes macro chunks by:
- Extracting relevant time segments using FFmpeg.
- Streaming audio instead of downloading full files for low latency.
- Splitting segments at silent points.
- Encoding audio in base64 for efficient transfer.
- Distributing micro chunks to the Whisper model for transcription.
This Chainlet runs in parallel with multiple instances autoscaled dynamically.
WhisperModel
(Inference Model)
A separately deployed Whisper model Chainlet handles speech-to-text transcription.
- Deployed independently to allow fast iteration on business logic without redeploying the model.
- Used across different Chains or accessed directly as a standalone model.
- Supports multiple environments (e.g., dev, prod) using the same instance.
Whisper can also be deployed as a standard Truss model, separate from the Chain.
4. Optimizing Performance
Even for very large files, processing time remains bounded by parallel execution.
micro_chunk_size_sec
→ Balance GPU utilization and inference latency.
macro_chunk_size_sec
→ Adjust chunk size for optimal parallelism.
- Autoscaling settings → Tune concurrency and replica counts for load balancing.
Example speedup:
{
"input_duration_sec": 734.26,
"processing_duration_sec": 82.42,
"speedup": 8.9
}
5. Deploy & Run the Chain
Deploy WhisperModel first:
truss chains push whisper_chainlet.py
Copy the invocation URL and update WHISPER_URL
in transcribe.py
.
Deploy the transcription Chain:
truss chains push transcribe.py
Run transcription on a sample file:
curl -X POST $INVOCATION_URL \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '<JSON_INPUT>'
Next Steps
- Learn more about Chains.
- Optimize GPU autoscaling for peak efficiency.
- Extend the pipeline with custom business logic.
Responses are generated using AI and may contain mistakes.