How to stream model output
Reduce time to first token for LLMs
For instructions on packaging and deploying a model with streaming output, see this Truss example. This guide covers how to call a model that has a streaming-capable endpoint.
Any model could be packaged with support for streaming output, but it only makes sense to do so for models where:
- Generating a complete output takes a relatively long time.
- The first tokens of output are useful without the context of the rest of the output.
- Reducing the time to first token improves the user experience.
LLMs in chat applications are the perfect use case for streaming model output.
Example: Streaming with Mistral
Mistral 7B Instruct from Baseten’s model library is a recent LLM with streaming support. Invocation should be the same for any other model library LLM as well as any Truss that follows the same standard.
Deploy Mistral 7B Instruct or a similar LLM to run the following examples.
Truss CLI
The Truss CLI has built-in support for streaming model output.
API endpoint
When using a streaming endpoint with cURL, use the --no-buffer
flag to stream output as it is received.
As with all cURL invocations, you’ll need a model ID and API key.
Python application
Let’s take things a step further and look at how to integrate streaming output with a Python application.