Skip to main content
Streaming refers to returning a model’s output incrementally, token by token, as it is generated, rather than holding the response until generation finishes. The caller reads the output as it builds, so the first tokens arrive after the time to first token (TTFT) instead of after the entire response. Baseten supports streaming across a range of inference surfaces: Model APIs (hosted, OpenAI- and Anthropic-compatible endpoints), BIS-LLM, and dedicated deployments of models packaged with Truss. Custom Docker containers that expose an OpenAI-compatible API, such as vLLM and SGLang, stream the same way. Use streaming when:
  • Generating the complete output takes a relatively long time.
  • The first tokens are useful without the rest of the output.
  • Reducing the time to first token improves the user experience.
Chat applications backed by LLMs are the clearest example.

Enable streaming

Streaming is a per-request flag: set it on your call, then read the response as it arrives. The flag is the same everywhere; only the base URL and model slug differ.
# Self-deployed Truss model: stream from the model's predict endpoint
import os
import requests

model_id = "YOUR_MODEL_ID"

with requests.post(
    f"https://model-{model_id}.api.baseten.co/production/predict",
    headers={"Authorization": f"Bearer {os.environ['BASETEN_API_KEY']}"},
    json={"prompt": "Write a haiku about the ocean.", "stream": True},
    stream=True,
) as resp:
    for chunk in resp.iter_content():
        print(chunk.decode("utf-8"), end="", flush=True)
Streaming changes when the caller sees output, not how much the model produces. The following diagram puts both delivery modes on one clock. The top lane streams: after a short prefill, tokens fill in one at a time from the first-token mark (TTFT). The bottom lane is non-streaming: it stays empty through the same generation, then the whole response lands at once at the end. Both finish together, so the only difference is when the caller first sees output. Token timing here is illustrative, not a measured latency.