> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Streaming

> How to call a model that has a streaming-capable endpoint.

<Note>
  Streaming works for both [Model APIs](/inference/model-apis/overview) (hosted, OpenAI-compatible endpoints) and self-deployed models packaged with Truss. The same patterns apply to both. Only the base URL and model slug differ.
</Note>

Any model could be packaged with support for streaming output, but it only makes sense to do so for models where:

* Generating a complete output takes a relatively long time.
* The first tokens of output are useful without the context of the rest of the output.
* Reducing the time to first token improves the user experience.

LLMs in chat applications are the perfect use case for streaming model output.

## Example: Streaming with Mistral

[Mistral 7B Instruct](https://www.baseten.co/library/mistral-7b-instruct) from Baseten's model library is a recent LLM with streaming support. Invocation should be the same for any other model library LLM as well as any Truss that follows the same standard.

[Deploy Mistral 7B Instruct](https://www.baseten.co/library/mistral-7b-instruct) or a similar LLM to run the following examples.

### Truss CLI

The Truss CLI has built-in support for streaming model output.

```sh theme={"system"}
truss predict -d '{"prompt": "What is the Mistral wind?", "stream": true}'
```

### API endpoint

When using a streaming endpoint with cURL, use the `--no-buffer` flag to stream output as it is received.

As with all cURL invocations, you'll need a model ID and API key.

```sh theme={"system"}
curl -X POST https://app.baseten.co/models/MODEL_ID/predict \
  -H 'Authorization: Api-Key YOUR_API_KEY' \
  -d '{"prompt": "What is the Mistral wind?", "stream": true}' \
  --no-buffer
```

### Python application

Let's take things a step further and look at how to integrate streaming output with a Python application.

```python theme={"system"}
import requests
import json
import os

# Model ID for production deployment
model_id = ""
# Read secrets from environment variables
baseten_api_key = os.environ["BASETEN_API_KEY"]

# Open session to enable streaming
s = requests.Session()
with s.post(
    # Endpoint for production deployment, see API reference for more
    f"https://model-{model_id}.api.baseten.co/production/predict",
    headers={"Authorization": f"Api-Key {baseten_api_key}"},
    # Include "stream": True in the data dict so the model knows to stream
    data=json.dumps({
      "prompt": "What even is AGI?",
      "stream": True,
      "max_new_tokens": 4096
    }),
    # Include stream=True as an argument so the requests library knows to stream
    stream=True,
) as resp:
    # Print the generated tokens as they get streamed
    for content in resp.iter_content():
        print(content.decode("utf-8"), end="", flush=True)
```
