View example on GitHub
1. Overview
This Chain enables fast, high-quality transcription by:- Partitioning long files (10+ hours) into smaller segments.
- Detecting silence to optimize split points.
- Parallelizing inference across multiple GPU-backed deployments.
- Batching requests to maximize throughput.
- Using range downloads for efficient data streaming.
- Leveraging
asyncio
for concurrent execution.
2. Chain Structure
Transcription is divided into two processing layers:- Macro chunks: Large segments (~300s) split from the source media file. These are processed in parallel to handle massive files efficiently.
- Micro chunks: Smaller segments (~5–30s) extracted from macro chunks and sent to the Whisper model for transcription.
3. Implementing the Chainlets
Transcribe
(Entrypoint Chainlet)
Handles transcription requests and dispatches tasks to worker Chainlets.
Function signature:
- Validates that the media source supports range downloads.
- Uses FFmpeg to extract metadata and duration.
- Splits the file into macro chunks, optimizing split points at silent sections.
- Dispatches macro chunk tasks to the MacroChunkWorker for processing.
- Collects micro chunk transcriptions, merges results, and returns the final text.
MacroChunkWorker
(Processing Chainlet)
Processes macro chunks by:
- Extracting relevant time segments using FFmpeg.
- Streaming audio instead of downloading full files for low latency.
- Splitting segments at silent points.
- Encoding audio in base64 for efficient transfer.
- Distributing micro chunks to the Whisper model for transcription.
WhisperModel
(Inference Model)
A separately deployed Whisper model Chainlet handles speech-to-text transcription.
- Deployed independently to allow fast iteration on business logic without redeploying the model.
- Used across different Chains or accessed directly as a standalone model.
- Supports multiple environments (e.g., dev, prod) using the same instance.
4. Optimizing Performance
Even for very large files, processing time remains bounded by parallel execution.Key performance tuning parameters:
micro_chunk_size_sec
→ Balance GPU utilization and inference latency.macro_chunk_size_sec
→ Adjust chunk size for optimal parallelism.- Autoscaling settings → Tune concurrency and replica counts for load balancing.
5. Deploy & Run the Chain
Deploy WhisperModel first:
WHISPER_URL
in transcribe.py
.
Deploy the transcription Chain:
Run transcription on a sample file:
Next Steps
- Learn more about Chains.
- Optimize GPU autoscaling for peak efficiency.
- Extend the pipeline with custom business logic.