LLM Context Windows
The context window is the hard limit on how many tokens an LLM can process in a single request. It counts the system prompt, conversation history, retrieved documents, tool definitions, and the model's own output — all against the same budget.
Current sizes (April 2026)
- Claude Sonnet 4.6 / Opus 4.7 — 200k standard, 1M in the
[1m]tier. - GPT-5 — 400k context, 128k max output.
- Gemini 2.5 Pro — 2M tokens.
- Most open models — 8k to 128k depending on the build.
A million tokens is roughly 750k English words or 50k lines of TypeScript.
What you can't do with a huge window
A big number on the spec sheet does not mean the model uses all of it equally well. Three practical limits:
- Attention dilution. Retrieval accuracy degrades toward the middle of long contexts — the "lost in the middle" effect. Most models recall the start and end far better.
- Cost scales linearly. Every token in the window is billed per call. A 500k-token prompt at Opus rates costs several dollars each turn.
- Latency scales super-linearly. Time to first token grows fast past 100k, even with speculative decoding.
When to use a large window
- Feeding a whole codebase for a refactor review.
- Long legal documents where chunking would lose cross-references.
- Multi-turn conversations with dense tool output (logs, data tables).
When NOT to
- RAG over a fixed knowledge base — retrieve the top-5 chunks and save 95% on tokens.
- Repeated queries over the same corpus — embed once, search many times.
- Simple Q&A — a 4k context with a good prompt beats a 200k context stuffed with noise.
Practical tip
Measure recall. Drop a needle (a unique sentence) at depth 10%, 50%, and 90% of your prompt and ask the model to return it. If 50%-depth recall is below 90%, you're past the useful window for that model and should switch to retrieval.