What is a Context Window?
The LLM's working memory — how much it can see at once.
A context window is the maximum number of tokens an LLM can process in a single inference request — including the system prompt, conversation history, retrieved documents, tool outputs, and the generated response. It is the hard ceiling on how much information the model can 'see' at once, and ranges from 8K tokens (GPT-3.5 era) to 1M+ tokens (Gemini 2.5 Pro, 2025).
A context window is the maximum number of tokens an LLM can process in a single inference request — including the system prompt, conversation history, retrieved documents, tool outputs, and the generated response. It is the hard ceiling on how much information the model can 'see' at once, and ranges from 8K tokens (GPT-3.5 era) to 1M+ tokens (Gemini 2.5 Pro, 2025).
In depth
Examples
- →Anthropic Claude 4.5 Sonnet — 200K token context window; handles a full codebase or book-length document in a single request
- →Anthropic Claude Opus 4.5 — 200K context; prompt caching cuts cached-prefix costs by 90%
- →OpenAI GPT-5 — 400K token context window in the Responses API
- →Google Gemini 2.5 Pro — 1M token context window generally available, 2M in preview; one of the largest production context windows in 2026
- →Magic.dev LTM-2-Mini — research model with 100M token context; demonstrates direction of the frontier
- →Meta Llama 3.3 70B — 128K context, best-in-class for open-source models in 2026
- →Tycoon AI employees — cached system prompts of ~20-50K tokens per role, plus RAG-retrieved project memory, fitting well within Claude 4.5's 200K window
Related terms
Frequently asked questions
How many tokens is an English word?
Roughly 1.3 tokens per word in English — or equivalently, ~0.75 words per token. So a 1000-token document is ~750 words, a 100K-token context is ~75K words (a 300-page book), and a 1M-token context is ~750K words (roughly the Lord of the Rings trilogy). Other languages have different ratios — Chinese and Japanese are often 1-2 characters per token, Korean 2-3, code often 2-3 tokens per line. Tokenizer differences (BPE vs. SentencePiece) also affect the ratio by 10-20%.
If the context window is 1M tokens, should I always use it?
No, for three reasons. (1) Cost — a 1M-token query costs 500x a 2K-token query. (2) Latency — processing 1M tokens takes 10-30 seconds versus sub-second for short queries. (3) Quality — the 'lost in the middle' problem means models don't attend equally to all parts of a long context. For production systems, use RAG to retrieve the 5-20K relevant tokens instead of stuffing everything in. Reserve long context for ad-hoc exploration, prototyping, and tasks where the entire document genuinely needs to be reasoned over as a whole (e.g., full codebase refactors).
What happens when I exceed the context window?
The API returns an error. There's no silent truncation by default — you have to explicitly manage context when approaching the limit. Common strategies: sliding window (drop oldest messages), summarization (replace old conversation with a summary), RAG (retrieve only relevant history), or hierarchical memory (keep recent messages verbatim, summarize older ones). All modern agent frameworks implement one or more of these; Tycoon uses a hybrid of RAG-based project memory plus recent conversation history verbatim.
Does a longer context window make the model smarter?
Not inherently — it lets the model see more at once but doesn't improve reasoning ability. A 1M-token Gemini 2.5 Pro reasoning over a small problem isn't smarter than a 200K-token Claude 4.5 Sonnet reasoning over the same problem. What long context enables is tasks that previously required RAG or manual chunking — reading whole documents, comparing across many files, maintaining very long conversations. For short-context tasks, the practical differences between models are about reasoning capability and speed, not window size.
What is prompt caching and how does it interact with context windows?
Prompt caching lets you designate a prefix (system prompt, large document, tool definitions) as cacheable — the provider keeps it in memory for a few minutes, and subsequent requests reusing that prefix pay 10-25% of the normal input cost for those tokens and skip most of the processing latency. This changes long-context economics dramatically: a 100K-token document that costs $0.30 fresh costs $0.03 cached, so you can afford to keep rich context always loaded. Anthropic, OpenAI, and Google all support it; Tycoon relies on it heavily to keep AI employees' system prompts and project memory always available without burning tokens on every turn.
Run your one-person company.
Hire your AI team in 30 seconds. Start for free.
Free to start · No credit card required · Set up in 30 seconds