Streaming output with an LLM
Deploy an LLM and stream the output
The worst part of using generative AI tools is the long wait time during model inference. For some types of generative models, including large language models (LLMs), you can start getting results 10X faster by streaming model output as it is generated.
LLMs have two properties that make streaming output particularly useful:
- Generating a complete response takes time, easily 10 seconds or more for longer outputs
- Partial outputs are often useful!
When you host your LLMs with Baseten, you can stream responses. Instead of having to wait for the entire output to be generated, you can immediately start returning results to users with a sub-one-second time-to-first-token.
In this example, we will show you how to deploy Falcon 7B, an LLM, and stream the output as it is generated.
You can see the code for the finished Falcon 7B Truss on the right. Keep reading for step-by-step instructions on how to build it.
Step 0: Initialize Truss
Get started by creating a new Truss:
Give your model a name when prompted, like falcon-streaming
. Then, navigate to the newly created directory:
Step 1: Set up the Model
class without streaming
As mentioned before, Falcon 7B is an LLM. We will use the Huggingface Transformers library to load and run the model. In this first step, we will generate output normally and return it without streaming the output.
In model/model.py
, we write the class Model
with three member functions:
__init__
, which creates an instance of the object with a_model
propertyload
, which runs once when the model server is spun up and loads thepipeline
modelpredict
, which runs each time the model is invoked and handles the inference. It can use any JSON-serializable type as input and output for non-streaming outputs.
Read the quickstart guide for more details on Model
class implementation.
Step 2: Add streaming support
Once we have a model that can produce the LLM outputs using the HuggingFace transformers library, we can adapt it to support streaming. The key change that needs to happen here is in the predict
function.
While in the above example, the predict
function returns a Dict
containing the model output, to stream results, we need to return a Python Generator
from the predict
function instead. This will allow us to return partial results to the user as they are generated.
To produce outputs incrementally for the LLM, we will pass a TextIteratorStreamer
object to the generate
function. This object will return the model output as it is generated. We will then kick off the generation on a separate thread.
What we return from the predict
function is a generator that will yield the model output from the streamer object as it is generated.
Step 3: Add remainder of Truss configuration
Once we have the model code written — the next thing we need to do before we deploy is make sure that we have the rest of the Truss configuration in place.
The only things we need to add to the config.yaml are the Python and hardware requirements for the model.
Step 4: Deploy the model
You’ll need a Baseten API key for this step.
We have successfully packaged Falcon as a Truss. Let’s deploy! Run:
Step 5: Invoke the model
You can invoke the model with: