LearnWhat is LLM Streaming?
How chat UIs hide multi-second response times behind typing indicators.
LLM streaming is the technique of returning tokens from an LLM API as they are generated rather than waiting for the complete response. Implemented via server-sent events (SSE) or WebSocket, streaming reduces perceived latency from seconds to under 500ms by letting the UI show text as the model types. It is the default mode for chat interfaces like ChatGPT, Claude, and agentic platforms including Tycoon.
Free to startNo credit card requiredUpdated Apr 2026
In depth
LLMs generate output autoregressively — one token at a time, each dependent on the previous. For a 500-token response at 50 tokens/second, total time is 10 seconds. Without streaming, the user waits 10 seconds of dead time. With streaming, the first token arrives in 200-500ms (time to first token, TTFT) and text appears progressively. Perceived latency drops from 'slow' to 'fast' even though total compute is identical.
The transport is usually server-sent events (SSE), a simple HTTP/1.1 protocol where the server keeps the connection open and sends event-type data chunks until done. OpenAI, Anthropic, and most other providers expose streaming via SSE with a structured event format: tokens, tool-call deltas, usage metadata, and a final 'done' event. WebSocket is used for bi-directional streaming (voice, multi-agent), but pure generation streaming is almost always SSE because it's simpler to scale.
Two latency metrics matter. Time to first token (TTFT) is how long until the user sees the first character — dominated by prompt processing and queue time. Typical ranges: 100-500ms for short prompts, 1-3 seconds for 50K+ token prompts. Inter-token latency (ITL) is the gap between subsequent tokens — determined by the model's throughput, typically 20-200 tokens/second depending on model size and load. Total response time = TTFT + (tokens × ITL). For chat UIs, TTFT is the dominant UX factor; users tolerate long total responses if the first word comes quickly.
Streaming has production gotchas. (1) Cloud platforms often idle-kill long SSE connections. Cloudflare, Cloud Run, and many load balancers close connections after 30-120 seconds of idle. For agentic flows where the LLM pauses to call tools, streams can die mid-response. Mitigation: send keepalive heartbeats every 10-15 seconds. (2) Proxies buffer. nginx, haproxy, and some CDNs buffer SSE responses by default, defeating streaming entirely. Mitigation: set proxy_buffering off explicitly. (3) Mobile networks drop connections. Clients need robust reconnection logic with resumption from the last received token. (4) Structured outputs are hard to stream. JSON mode streams partial JSON that's invalid until the final token, which breaks naive parsing. Streaming JSON parsers (ijson, partial-json) or incremental validators are needed.
Streaming with tool use adds complexity. The model emits tool-call deltas interleaved with text. Clients must parse the event stream to distinguish 'speaking to user' from 'calling a tool,' buffer tool-call arguments until complete, execute the tool, and feed results back. OpenAI, Anthropic, and Vercel AI SDK all expose helper libraries that handle this plumbing. Rolling your own is error-prone; use the SDK.
For AI agents, streaming is table stakes UX. Tycoon streams every Astra response over SSE so the user sees thinking as it happens. The platform explicitly hardens against Cloud Run idle timeouts — an incident in early 2026 caused streaming responses to vanish mid-reply when tool execution paused the stream too long. The current production architecture uses sync fallback for reliability (mentioned in the repo's CLAUDE.md) while streaming improvements continue. This is a common tradeoff: streaming is better UX when it works, sync is more reliable.