What is LLM Streaming?
How chat UIs hide multi-second response times behind typing indicators.
LLM streaming is the technique of returning tokens from an LLM API as they are generated rather than waiting for the complete response. Implemented via server-sent events (SSE) or WebSocket, streaming reduces perceived latency from seconds to under 500ms by letting the UI show text as the model types. It is the default mode for chat interfaces like ChatGPT, Claude, and agentic platforms including Tycoon.
LLM streaming is the technique of returning tokens from an LLM API as they are generated rather than waiting for the complete response. Implemented via server-sent events (SSE) or WebSocket, streaming reduces perceived latency from seconds to under 500ms by letting the UI show text as the model types. It is the default mode for chat interfaces like ChatGPT, Claude, and agentic platforms including Tycoon.
In depth
Examples
- →ChatGPT's typing effect — text appearing word by word — is SSE streaming from the OpenAI API
- →Claude.ai and Claude Code — every response is streamed, including thinking blocks in reasoning mode
- →Perplexity's answer generation streams tokens as the model produces them, citations arriving in-line
- →Cursor and Windsurf stream code completions as the AI types — perceptible even at sub-100ms TTFT
- →OpenAI Python SDK: `client.chat.completions.create(stream=True)` returns an iterator of delta events
- →Anthropic SDK: `client.messages.stream(...)` yields MessageStream events with text, tool_use, and usage metadata
- →Vercel AI SDK 5+: `streamText({model, messages})` wraps SSE in a UI-ready data stream with structured parts
- →Tycoon streams Astra's replies over SSE, with pg_notify bridging across Cloud Run pods for multi-instance consistency
Related terms
Frequently asked questions
Why do I need streaming if the total time is the same?
Because human perception of latency is dominated by the first response, not total duration. A 10-second response where text appears in 500ms feels snappy; the same 10-second response with no output until the end feels broken. UX research consistently shows users judge streaming chat as 'fast' and non-streaming as 'slow' even for identical total times. There's also a practical benefit: users can interrupt a streaming response if it's heading the wrong way, saving compute and time. For any user-facing LLM app, streaming should be the default.
How do I stream from OpenAI or Anthropic?
Both providers expose streaming in their SDKs. OpenAI: `client.chat.completions.create(..., stream=True)` returns an iterator of ChatCompletionChunk objects. Anthropic: `client.messages.stream(...)` returns a context manager yielding MessageStream events. For browsers, use Vercel AI SDK or a similar framework that handles the SSE plumbing — manually parsing SSE in JavaScript is doable but tedious. For production, always handle connection drops with retry logic, send keepalives, and disable buffering on any reverse proxy in the path.
Does streaming work with structured outputs like JSON?
It works, but parsing is tricky because you receive partial JSON that's invalid until the last character. Three solutions: (1) Don't parse incrementally; buffer until the stream ends, then parse once. Simple but loses the streaming UX for structured data. (2) Use a partial/streaming JSON parser (ijson in Python, partial-json in JavaScript) that tolerates in-progress JSON. (3) Use a streaming schema library like Vercel AI SDK's useObject which handles this natively. For most use cases where the JSON feeds a UI progressively (form fields populating, tables growing), option 3 is the best. For pure API-to-API calls, option 1 is fine.
Why do my streams die on Cloud Run or Vercel?
Default idle timeouts. Cloud Run by default closes HTTP connections after 300 seconds; serverless platforms are even shorter. Internal load balancers and CDNs in the path may close faster. If your LLM call includes tool execution that pauses the stream for 30+ seconds, the platform may kill the connection as idle. Mitigations: (1) Send SSE keepalive comments (`: keepalive\n\n`) every 10-15 seconds during silent periods. (2) Configure the platform's idle timeout to the max your plan allows. (3) For truly long responses with long pauses, consider an async pattern — client subscribes to a queue (Redis, Postgres LISTEN/NOTIFY) rather than holding an HTTP connection. Tycoon uses Postgres NOTIFY for cross-pod streaming precisely because Cloud Run HTTP connections are fragile.
Can I stream chain-of-thought or reasoning tokens separately?
Yes, and the APIs increasingly surface this. Anthropic's Claude 4.5 thinking mode streams thinking content in distinct events (typed as 'thinking' blocks) alongside the final response. OpenAI's o1/GPT-5 thinking hides internal reasoning by default but exposes reasoning tokens in the usage metadata and optionally as a summary. DeepSeek R1 exposes full reasoning traces inline. For UI, most products show a collapsible 'thinking' panel that renders the thinking stream separately from the final answer — this matches user mental models and avoids confusing the reasoning with the actual reply. Structured stream events make this parsing straightforward.
Run your one-person company.
Hire your AI team in 30 seconds. Start for free.
Free to start · No credit card required · Set up in 30 seconds