Implementation (Advanced)
Streaming output
Streaming Output for LLMs
Streaming output significantly reduces wait time for generative AI models by returning results as they are generated instead of waiting for the full response.
Why Streaming?
- ✅ Faster response time – Get initial results in under 1 second instead of waiting 10+ seconds.
- ✅ Improved user experience – Partial outputs are immediately usable.
This guide walks through deploying Falcon 7B with streaming enabled.
1. Initialize Truss
2: Implement Model (Non-Streaming)
This first version loads the Falcon 7B model without streaming:
model/model.py
3. Add Streaming Support
To enable streaming, we:
- Use
TextIteratorStreamer
to stream tokens as they are generated. - Run
generate()
in a separate thread to prevent blocking. - Return a generator that streams results.
model/model.py
4. Configure config.yaml
config.yaml
5. Deploy & Invoke
Deploy the model:
Invoke with:
Was this page helpful?