AIStackWatch
Back to wiki

LLM Context Windows

The context window is the hard limit on how many tokens an LLM can process in a single request. It counts the system prompt, conversation history, retrieved documents, tool definitions, and the model's own output — all against the same budget.

Current sizes (April 2026)

  • Claude Sonnet 4.6 / Opus 4.7 — 200k standard, 1M in the [1m] tier.
  • GPT-5 — 400k context, 128k max output.
  • Gemini 2.5 Pro — 2M tokens.
  • Most open models — 8k to 128k depending on the build.

A million tokens is roughly 750k English words or 50k lines of TypeScript.

What you can't do with a huge window

A big number on the spec sheet does not mean the model uses all of it equally well. Three practical limits:

  • Attention dilution. Retrieval accuracy degrades toward the middle of long contexts — the "lost in the middle" effect. Most models recall the start and end far better.
  • Cost scales linearly. Every token in the window is billed per call. A 500k-token prompt at Opus rates costs several dollars each turn.
  • Latency scales super-linearly. Time to first token grows fast past 100k, even with speculative decoding.

When to use a large window

  • Feeding a whole codebase for a refactor review.
  • Long legal documents where chunking would lose cross-references.
  • Multi-turn conversations with dense tool output (logs, data tables).

When NOT to

  • RAG over a fixed knowledge base — retrieve the top-5 chunks and save 95% on tokens.
  • Repeated queries over the same corpus — embed once, search many times.
  • Simple Q&A — a 4k context with a good prompt beats a 200k context stuffed with noise.

Practical tip

Measure recall. Drop a needle (a unique sentence) at depth 10%, 50%, and 90% of your prompt and ask the model to return it. If 50%-depth recall is below 90%, you're past the useful window for that model and should switch to retrieval.