Building an LLM with streaming output
load
function of the Truss, we implement logic
involved in downloading the chat version of the Qwen 7B model and loading it into memory.
preprocess
function of the Truss, we set up a generate_args
dictionary with some generation arguments from the inference request to be used in the predict
function.
predict
function of the Truss, we implement the actual
inference logic.
The two main steps are:
generate
function if we’re not streaming the output, otherwise call the stream
helper functionstream
helper functionTextIteratorStreamer
object, which we’ll later use for
returning the LLM output to users.
streamer
object
that we created previously.
streamer
,
which produces output and yields it until the generation is complete.
We define this inner
function to create our generator.
config.yaml