Why streaming?
- Faster response time: Get initial results in under 1 second instead of waiting 10 or more seconds.
- Improved user experience: Partial outputs are immediately usable.
1. Initialize Truss
2. Implement the model without streaming
This first version loads the Falcon 7B model without streaming:model/model.py
3. Add streaming support
To enable streaming:- Use
TextIteratorStreamerto stream tokens as they’re generated. - Run
generate()in a separate thread to prevent blocking. - Return a generator that streams results.
model/model.py
4. Configure config.yaml
config.yaml