AIStackWatch
Back to wiki

Streaming LLM Responses

Streaming means the provider sends tokens down the wire as the model emits them, rather than buffering the full response before replying. For user-facing products it is table stakes — a 10-second response that starts after 400ms feels instant; the same response held for 10 seconds feels broken.

The transport

Nearly every provider uses Server-Sent Events (SSE) over HTTP. Each event is a small JSON object with a delta:

data: {"type": "content_block_delta", "delta": {"text": "Hel"}}
data: {"type": "content_block_delta", "delta": {"text": "lo"}}
data: {"type": "message_stop"}

Your client concatenates deltas into the final response. Tool-use also streams, usually as a sequence of tool_use_delta events with growing JSON.

Why it matters beyond UX

  • Cancel-on-change. User edits the input while a response is mid-stream; abort the request and save half the output tokens. Often 30%+ cost savings on chat apps.
  • Early parsing. You can parse structured output as it arrives and start downstream work sooner.
  • Progress signals. Spinners feel honest when tokens are visibly appearing.

Implementation hazards

  • Proxy buffering. Nginx, Cloudflare, corporate firewalls buffer SSE by default. You see full responses in one shot. Set X-Accel-Buffering: no, use HTTP/2, test from real networks.
  • Non-streaming middleware. Request logging that serializes the full body breaks streams. Fix or bypass.
  • JSON parsing mid-stream. Tool-use streams partial JSON that is syntactically invalid until complete. Use a tolerant parser (partial-json or equivalent) if you want live rendering.
  • Error handling. Errors can arrive mid-stream — connection drop, content-filter trigger, token budget exhaustion. UI must handle "response partially done, then an error."

Providers

OpenAI, Anthropic, xAI, Google, Together, Fireworks all support SSE streaming with nearly identical shapes. Client SDKs abstract the event loop; you rarely touch the raw stream.

When NOT to stream

  • Background jobs. If no human is waiting, streaming adds complexity with no benefit. Use the regular endpoint.
  • Short structured outputs. A 50-token classification response finishes in 500ms; streaming saves nothing and complicates parsing.
  • Embeddings. They're not generative and don't stream anyway.
  • Strict end-to-end validation. If you must validate the full structured output before showing anything, buffer on your side and show a spinner.

Rule of thumb: stream user-facing conversational responses, buffer machine-to-machine responses.