View on GitHub
Why Streaming?
For certain ML models, generations can take a long time. Especially with LLMs, a long output could take 10-20 seconds to generate. However, because LLMs generate tokens in sequence, useful output can be made available to users sooner. To support this, in Truss, we support streaming output.Set up the imports
In this example, we use the HuggingFace transformers library to build a text generation model.model/model.py
Define the load function
In theload function of the Truss, we implement logic
involved in downloading the chat version of the Qwen 7B model and loading it into memory.
model/model.py
Define the preprocess function
In thepreprocess function of the Truss, we set up a generate_args dictionary with some generation arguments from the inference request to be used in the predict function.
model/model.py
Define the predict function
In thepredict function of the Truss, we implement the actual
inference logic.
The two main steps are:
- Tokenize the input
- Call the model’s
generatefunction if we’re not streaming the output, otherwise call thestreamhelper function
model/model.py
Define the stream helper function
In this helper function, we’ll instantiate the TextIteratorStreamer object, which we’ll later use for
returning the LLM output to users.
model/model.py
streamer object
that we created previously.
model/model.py
model/model.py
streamer,
which produces output and yields it until the generation is complete.
We define this inner function to create our generator.
model/model.py
Setting up the config.yaml
Running Qwen 7B requires torch, transformers,
and a few other related libraries.
config.yaml
Configure resources for Qwen
We will use an L4 to run this model.config.yaml