- Generating the complete output takes a relatively long time.
- The first tokens are useful without the rest of the output.
- Reducing the time to first token improves the user experience.
Inference
Streaming
Return model output token by token as it is generated.
Streaming refers to returning a model’s output incrementally, token by token, as it is generated, rather than holding the response until generation finishes. The caller reads the output as it builds, so the first tokens arrive after the time to first token (TTFT) instead of after the entire response.
Baseten supports streaming across a range of inference surfaces: Model APIs (hosted, OpenAI- and Anthropic-compatible endpoints), BIS-LLM, and dedicated deployments of models packaged with Truss. Custom Docker containers that expose an OpenAI-compatible API, such as vLLM and SGLang, stream the same way.
Use streaming when:
Streaming changes when the caller sees output, not how much the model produces. The following diagram puts both delivery modes on one clock.
The top lane streams: after a short prefill, tokens fill in one at a time from the first-token mark (TTFT). The bottom lane is non-streaming: it stays empty through the same generation, then the whole response lands at once at the end. Both finish together, so the only difference is when the caller first sees output. Token timing here is illustrative, not a measured latency.