Learn

What is LLM Streaming?

How chat UIs hide multi-second response times behind typing indicators.

LLM streaming is the technique of returning tokens from an LLM API as they are generated rather than waiting for the complete response. Implemented via server-sent events (SSE) or WebSocket, streaming reduces perceived latency from seconds to under 500ms by letting the UI show text as the model types. It is the default mode for chat interfaces like ChatGPT, Claude, and agentic platforms including Tycoon.

Free to startNo credit card requiredUpdated Apr 2026
Short answer

LLM streaming is the technique of returning tokens from an LLM API as they are generated rather than waiting for the complete response. Implemented via server-sent events (SSE) or WebSocket, streaming reduces perceived latency from seconds to under 500ms by letting the UI show text as the model types. It is the default mode for chat interfaces like ChatGPT, Claude, and agentic platforms including Tycoon.

In depth

LLMs generate output autoregressively — one token at a time, each dependent on the previous. For a 500-token response at 50 tokens/second, total time is 10 seconds. Without streaming, the user waits 10 seconds of dead time. With streaming, the first token arrives in 200-500ms (time to first token, TTFT) and text appears progressively. Perceived latency drops from 'slow' to 'fast' even though total compute is identical. The transport is usually server-sent events (SSE), a simple HTTP/1.1 protocol where the server keeps the connection open and sends event-type data chunks until done. OpenAI, Anthropic, and most other providers expose streaming via SSE with a structured event format: tokens, tool-call deltas, usage metadata, and a final 'done' event. WebSocket is used for bi-directional streaming (voice, multi-agent), but pure generation streaming is almost always SSE because it's simpler to scale. Two latency metrics matter. Time to first token (TTFT) is how long until the user sees the first character — dominated by prompt processing and queue time. Typical ranges: 100-500ms for short prompts, 1-3 seconds for 50K+ token prompts. Inter-token latency (ITL) is the gap between subsequent tokens — determined by the model's throughput, typically 20-200 tokens/second depending on model size and load. Total response time = TTFT + (tokens × ITL). For chat UIs, TTFT is the dominant UX factor; users tolerate long total responses if the first word comes quickly. Streaming has production gotchas. (1) Cloud platforms often idle-kill long SSE connections. Cloudflare, Cloud Run, and many load balancers close connections after 30-120 seconds of idle. For agentic flows where the LLM pauses to call tools, streams can die mid-response. Mitigation: send keepalive heartbeats every 10-15 seconds. (2) Proxies buffer. nginx, haproxy, and some CDNs buffer SSE responses by default, defeating streaming entirely. Mitigation: set proxy_buffering off explicitly. (3) Mobile networks drop connections. Clients need robust reconnection logic with resumption from the last received token. (4) Structured outputs are hard to stream. JSON mode streams partial JSON that's invalid until the final token, which breaks naive parsing. Streaming JSON parsers (ijson, partial-json) or incremental validators are needed. Streaming with tool use adds complexity. The model emits tool-call deltas interleaved with text. Clients must parse the event stream to distinguish 'speaking to user' from 'calling a tool,' buffer tool-call arguments until complete, execute the tool, and feed results back. OpenAI, Anthropic, and Vercel AI SDK all expose helper libraries that handle this plumbing. Rolling your own is error-prone; use the SDK. For AI agents, streaming is table stakes UX. Tycoon streams every Astra response over SSE so the user sees thinking as it happens. The platform explicitly hardens against Cloud Run idle timeouts — an incident in early 2026 caused streaming responses to vanish mid-reply when tool execution paused the stream too long. The current production architecture uses sync fallback for reliability (mentioned in the repo's CLAUDE.md) while streaming improvements continue. This is a common tradeoff: streaming is better UX when it works, sync is more reliable.

Examples

  • ChatGPT's typing effect — text appearing word by word — is SSE streaming from the OpenAI API
  • Claude.ai and Claude Code — every response is streamed, including thinking blocks in reasoning mode
  • Perplexity's answer generation streams tokens as the model produces them, citations arriving in-line
  • Cursor and Windsurf stream code completions as the AI types — perceptible even at sub-100ms TTFT
  • OpenAI Python SDK: `client.chat.completions.create(stream=True)` returns an iterator of delta events
  • Anthropic SDK: `client.messages.stream(...)` yields MessageStream events with text, tool_use, and usage metadata
  • Vercel AI SDK 5+: `streamText({model, messages})` wraps SSE in a UI-ready data stream with structured parts
  • Tycoon streams Astra's replies over SSE, with pg_notify bridging across Cloud Run pods for multi-instance consistency

Related terms

Frequently asked questions

Why do I need streaming if the total time is the same?

Because human perception of latency is dominated by the first response, not total duration. A 10-second response where text appears in 500ms feels snappy; the same 10-second response with no output until the end feels broken. UX research consistently shows users judge streaming chat as 'fast' and non-streaming as 'slow' even for identical total times. There's also a practical benefit: users can interrupt a streaming response if it's heading the wrong way, saving compute and time. For any user-facing LLM app, streaming should be the default.

How do I stream from OpenAI or Anthropic?

Both providers expose streaming in their SDKs. OpenAI: `client.chat.completions.create(..., stream=True)` returns an iterator of ChatCompletionChunk objects. Anthropic: `client.messages.stream(...)` returns a context manager yielding MessageStream events. For browsers, use Vercel AI SDK or a similar framework that handles the SSE plumbing — manually parsing SSE in JavaScript is doable but tedious. For production, always handle connection drops with retry logic, send keepalives, and disable buffering on any reverse proxy in the path.

Does streaming work with structured outputs like JSON?

It works, but parsing is tricky because you receive partial JSON that's invalid until the last character. Three solutions: (1) Don't parse incrementally; buffer until the stream ends, then parse once. Simple but loses the streaming UX for structured data. (2) Use a partial/streaming JSON parser (ijson in Python, partial-json in JavaScript) that tolerates in-progress JSON. (3) Use a streaming schema library like Vercel AI SDK's useObject which handles this natively. For most use cases where the JSON feeds a UI progressively (form fields populating, tables growing), option 3 is the best. For pure API-to-API calls, option 1 is fine.

Why do my streams die on Cloud Run or Vercel?

Default idle timeouts. Cloud Run by default closes HTTP connections after 300 seconds; serverless platforms are even shorter. Internal load balancers and CDNs in the path may close faster. If your LLM call includes tool execution that pauses the stream for 30+ seconds, the platform may kill the connection as idle. Mitigations: (1) Send SSE keepalive comments (`: keepalive\n\n`) every 10-15 seconds during silent periods. (2) Configure the platform's idle timeout to the max your plan allows. (3) For truly long responses with long pauses, consider an async pattern — client subscribes to a queue (Redis, Postgres LISTEN/NOTIFY) rather than holding an HTTP connection. Tycoon uses Postgres NOTIFY for cross-pod streaming precisely because Cloud Run HTTP connections are fragile.

Can I stream chain-of-thought or reasoning tokens separately?

Yes, and the APIs increasingly surface this. Anthropic's Claude 4.5 thinking mode streams thinking content in distinct events (typed as 'thinking' blocks) alongside the final response. OpenAI's o1/GPT-5 thinking hides internal reasoning by default but exposes reasoning tokens in the usage metadata and optionally as a summary. DeepSeek R1 exposes full reasoning traces inline. For UI, most products show a collapsible 'thinking' panel that renders the thinking stream separately from the final answer — this matches user mental models and avoids confusing the reasoning with the actual reply. Structured stream events make this parsing straightforward.

Run your one-person company.

Hire your AI team in 30 seconds. Start for free.

Free to start · No credit card required · Set up in 30 seconds