How to stream model output
Reduce time to first token for LLMs
For instructions on packaging and deploying a model with streaming output, see this Truss example. This guide covers how to call a model that has a streaming-capable endpoint.
Any model could be packaged with support for streaming output, but it only makes sense to do so for models where:
- Generating a complete output takes a relatively long time.
- The first tokens of output are useful without the context of the rest of the output.
- Reducing the time to first token improves the user experience.
LLMs in chat applications are the perfect use case for streaming model output.
Example: Streaming with Mistral
Mistral 7B Instruct from Baseten’s model library is a recent LLM with streaming support. Invocation should be the same for any other model library LLM as well as any Truss that follows the same standard.
Deploy Mistral 7B Instruct or a similar LLM to run the following examples.
Truss CLI
The Truss CLI has built-in support for streaming model output.
truss predict -d '{"prompt": "What is the Mistral wind?", "stream": true}'
API endpoint
When using a streaming endpoint with cURL, use the --no-buffer
flag to stream output as it is received.
As with all cURL invocations, you’ll need a model ID and API key.
curl -X POST https://app.baseten.co/models/MODEL_ID/predict \
-H 'Authorization: Api-Key YOUR_API_KEY' \
-d '{"prompt": "What is the Mistral wind?", "stream": true}' \
--no-buffer
Python application
Let’s take things a step further and look at how to integrate streaming output with a Python application.
import requests
import json
import os
# Model ID for production deployment
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]
# Open session to enable streaming
s = requests.Session()
with s.post(
# Endpoint for production deployment, see API reference for more
f"https://model-{model_id}.api.baseten.co/production/predict",
headers={"Authorization": f"Api-Key {baseten_api_key}"},
# Include "stream": True in the data dict so the model knows to stream
data=json.dumps({
"prompt": "What even is AGI?",
"stream": True,
"max_new_tokens": 4096
}),
# Include stream=True as an argument so the requests libray knows to stream
stream=True,
) as resp:
# Print the generated tokens as they get streamed
for content in resp.iter_content():
print(content.decode("utf-8"), end="", flush=True)