For instructions on packaging and deploying a model with streaming output, see this Truss example. This guide covers how to call a model that has a streaming-capable endpoint.

Any model could be packaged with support for streaming output, but it only makes sense to do so for models where:

  • Generating a complete output takes a relatively long time.
  • The first tokens of output are useful without the context of the rest of the output.
  • Reducing the time to first token improves the user experience.

LLMs in chat applications are the perfect use case for streaming model output.

Example: Streaming with Mistral

Mistral 7B Instruct from Baseten’s model library is a recent LLM with streaming support. Invocation should be the same for any other model library LLM as well as any Truss that follows the same standard.

Deploy Mistral 7B Instruct or a similar LLM to run the following examples.

Truss CLI

The Truss CLI has built-in support for streaming model output.

truss predict -d '{"prompt": "What is the Mistral wind?", "stream": true}'

API endpoint

When using a streaming endpoint with cURL, use the --no-buffer flag to stream output as it is received.

As with all cURL invocations, you’ll need a model ID and API key.

curl -X POST https://app.baseten.co/models/MODEL_ID/predict \
  -H 'Authorization: Api-Key YOUR_API_KEY' \
  -d '{"prompt": "What is the Mistral wind?", "stream": true}' \
  --no-buffer

Python application

Let’s take things a step further and look at how to integrate streaming output with a Python application.

import requests

# Model ID for production deployment
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]

# Open session to enable streaming
s = requests.Session()
with s.post(
    # Endpoint for production deployment, see API reference for more
    f"https://model-{model_id}.api.baseten.co/production/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    # Include "stream": True in the data dict so the model knows to stream
    data=json.dumps({"prompt": prompt, "stream": True, "max_new_tokens": 4096}),
    # Include stream=True as an argument so the requests libray knows to stream
    stream=True,
) as resp:
    # Pass 1 to iter_content() to go a token at a time
    for token in resp.iter_content(1):
        token = token.decode("utf-8")
        # Do something here with each token
        print(token)