Learn

What is a Context Window?

The LLM's working memory — how much it can see at once.

A context window is the maximum number of tokens an LLM can process in a single inference request — including the system prompt, conversation history, retrieved documents, tool outputs, and the generated response. It is the hard ceiling on how much information the model can 'see' at once, and ranges from 8K tokens (GPT-3.5 era) to 1M+ tokens (Gemini 2.5 Pro, 2025).

Free to startNo credit card requiredUpdated Apr 2026

Short answer

In depth

Context windows are a consequence of the transformer architecture. Every token the model processes attends to every other token in the window via the attention mechanism, and the memory and compute cost scales quadratically with window length. Doubling the window 4x's the compute, which is why context windows grew slowly from 2020 (GPT-3: 2K tokens) through 2023 (GPT-4: 8K then 32K) and then rapidly as architectural innovations (FlashAttention, Ring Attention, sparse attention) reduced the quadratic penalty. A 'token' is roughly 3-4 characters or 0.75 words in English — so 1000 tokens is ~750 words. Current context windows in 2026 span a wide range. Claude 4.5 Sonnet: 200K tokens (~150K words, roughly a 500-page book). GPT-5: 400K tokens. Gemini 2.5 Pro: 1M tokens, with 2M in limited preview. Open-source Llama 3.3 70B: 128K. Specialized models like Magic.dev's LTM-2-Mini claim 100M tokens. The practical ceiling keeps rising and shows no sign of stabilizing. What fills a context window in production. (1) System prompt: role definition, rules, output format — typically 500-5000 tokens. (2) Conversation history: previous user/assistant turns — grows over time. (3) Retrieved documents (RAG): chunks pulled from a vector store — typically 2-30K tokens. (4) Tool definitions and outputs: function schemas and results from tool calls — 1-50K tokens depending on the agent. (5) User message: the current question. (6) Generated response: the model's output also counts. Total must fit under the window ceiling. Longer context windows unlocked real capabilities. 100K+ tokens meant you could dump entire code repositories, whole PDF books, or full legal contracts into a single request and ask questions. 1M tokens means entire video transcripts, years of Slack history, or complete codebases fit. But raw capacity isn't the same as effective capacity. The 'lost in the middle' problem, documented by Stanford researchers in 2023 and persisting in 2026, shows that LLMs attend best to information at the beginning and end of a long context and often miss information in the middle. A 200K-token context may only produce high-accuracy recall over ~50K of it in practice. Newer architectures (Claude 4.5, Gemini 2.5) have partly mitigated this, but 'needle-in-a-haystack' benchmarks still show degradation at long contexts for complex retrieval tasks. Cost is the other major consideration. Most providers charge per input token, so a 1M-token query costs 500x a 2K-token query. Even at $1-5 per million tokens (2026 pricing), filling a 1M-token window on every query adds up fast. This is why RAG — which retrieves only the relevant 5-20K tokens per query rather than the full corpus — remains essential even when 1M+ context is available. The right frame: long context is for prototyping and ad-hoc queries; RAG is for production systems at scale. Caching matters too. Anthropic, OpenAI, and Google all offer prompt caching — large static prefixes (system prompts, document corpus) can be cached server-side and reused across requests at a fraction of the cost and latency. This changes the economics of long-context deployment: a 100K-token cached prefix costs 10% as much as sending it fresh each time. Tycoon uses prompt caching heavily so AI employees can retain rich project context without paying full token costs on every turn.

Examples

→Anthropic Claude 4.5 Sonnet — 200K token context window; handles a full codebase or book-length document in a single request
→Anthropic Claude Opus 4.5 — 200K context; prompt caching cuts cached-prefix costs by 90%
→OpenAI GPT-5 — 400K token context window in the Responses API
→Google Gemini 2.5 Pro — 1M token context window generally available, 2M in preview; one of the largest production context windows in 2026
→Magic.dev LTM-2-Mini — research model with 100M token context; demonstrates direction of the frontier
→Llama 3.3 70B — 128K context, best-in-class for open-source models in 2026
→Tycoon AI employees — cached system prompts of ~20-50K tokens per role, plus RAG-retrieved project memory, fitting well within Claude 4.5's 200K window

FAQ

Frequently asked questions

Clear answers about wallet credit, usage, subscriptions, and how Tycoon charges for work.

How many tokens is an English word?

Roughly 1.3 tokens per word in English — or equivalently, ~0.75 words per token. So a 1000-token document is ~750 words, a 100K-token context is ~75K words (a 300-page book), and a 1M-token context is ~750K words (roughly the Lord of the Rings trilogy). Other languages have different ratios — Chinese and Japanese are often 1-2 characters per token, Korean 2-3, code often 2-3 tokens per line. Tokenizer differences (BPE vs. SentencePiece) also affect the ratio by 10-20%.

If the context window is 1M tokens, should I always use it?

No, for three reasons. (1) Cost — a 1M-token query costs 500x a 2K-token query. (2) Latency — processing 1M tokens takes 10-30 seconds versus sub-second for short queries. (3) Quality — the 'lost in the middle' problem means models don't attend equally to all parts of a long context. For production systems, use RAG to retrieve the 5-20K relevant tokens instead of stuffing everything in. Reserve long context for ad-hoc exploration, prototyping, and tasks where the entire document genuinely needs to be reasoned over as a whole (e.g., full codebase refactors).

What happens when I exceed the context window?

The API returns an error. There's no silent truncation by default — you have to explicitly manage context when approaching the limit. Common strategies: sliding window (drop oldest messages), summarization (replace old conversation with a summary), RAG (retrieve only relevant history), or hierarchical memory (keep recent messages verbatim, summarize older ones). All modern agent frameworks implement one or more of these; Tycoon uses a hybrid of RAG-based project memory plus recent conversation history verbatim.

Does a longer context window make the model smarter?

Not inherently — it lets the model see more at once but doesn't improve reasoning ability. A 1M-token Gemini 2.5 Pro reasoning over a small problem isn't smarter than a 200K-token Claude 4.5 Sonnet reasoning over the same problem. What long context enables is tasks that previously required RAG or manual chunking — reading whole documents, comparing across many files, maintaining very long conversations. For short-context tasks, the practical differences between models are about reasoning capability and speed, not window size.

What is prompt caching and how does it interact with context windows?

Prompt caching lets you designate a prefix (system prompt, large document, tool definitions) as cacheable — the provider keeps it in memory for a few minutes, and subsequent requests reusing that prefix pay 10-25% of the normal input cost for those tokens and skip most of the processing latency. This changes long-context economics dramatically: a 100K-token document that costs $0.30 fresh costs $0.03 cached, so you can afford to keep rich context always loaded. Anthropic, OpenAI, and Google all support it; Tycoon relies on it heavily to keep AI employees' system prompts and project memory always available without burning tokens on every turn.

Run your company with humans and AI agents.

Hire your AI team in 30 seconds. Start for free.

Free to start · No credit card required · Set up in 30 seconds