April 30, 2026
The 40K Token Wall
Why bigger context windows are not the answer, and what production-tuned engineers actually trust in 2026.
Contents (6)
TL;DR. Language models are not invariant to their context windows. Whether you have 1M, 2M, or 10M tokens of headroom, builders who have actually evaluated production agentic systems do not trust the model past about 40K tokens before quality starts dropping. This is the opposite of what every model provider's marketing implies. Standard RAG is dead. The replacement is not a longer context window. It is curated context, written and read by sub-agents that know what to fetch and what to evict.
The headline number
Jeff Huber from Chroma published research on this in late 2025 and reinforced it at AI Dev SF on April 28. Their team ran production-tuned evals across multiple frontier models with controlled context lengths. Eugene Yan's Evaluating Long-Context Q&A Systems (June 2025) reaches the same conclusion from the eval-design side: long-context benchmarks consistently overstate what production-tuned systems can rely on.
The finding: even the strongest 2026 models drift past about 40K tokens of context. Not crash. Not refuse. Just get worse. Slower retrieval inside the window. Lower precision on extraction. More likely to confidently misquote earlier turns. The exact threshold varies by model and by task, but the inflection point sits between 30K and 50K consistently.
This is the inverse of the marketing. Anthropic, OpenAI, and Google all advertise million-token windows. The windows technically work. The model is technically running. But the quality of every token-level decision the model is making drops. By 100K tokens the model is meaningfully worse than at 10K. By 500K it is closer to a hallucination machine than a reasoning system.
Why this matters in 2026
Three reasons it matters now and did not matter in 2024.
Agents read context they wrote five turns ago. A monolithic chat with one user fits in 8K. An agentic loop that plans, retrieves, executes, critiques, re-plans, and verifies accumulates context every step. Twenty turns into a real agent run, you are at 80K. Fifty turns in you are past 200K. The drift Huber documented happens silently inside the loop you wrote.
Tool catalogs are huge. GitHub's official MCP server exposes 80+ tools. Each tool description is tens of tokens. A 50-tool catalog is roughly 50K tokens of context overhead per turn, billed every turn, and silently degrading every turn.
The price of getting it wrong is asymmetric. A monolith chat that drifts produces a wrong answer the user catches. An agentic loop that drifts at turn 30 takes an irreversible action against a real system. By the time you catch it, the wrong rows are in the database.
What replaced standard RAG
The phrase "RAG is dead" overstates the case. Single-pass retrieve-then-generate RAG is dead. The replacement is the broader practice the field is calling context engineering. Gartner predicted 2026 would be "the year of context" and that prediction has held.
The actual pattern that works:
| Old (single-pass RAG) | New (agentic context engineering) |
|---|---|
| One query → one retrieval | Plan → many parallel sub-queries |
| Vector search returns text chunks | Hybrid: semantic + keyword + structured |
| Stuff into prompt, generate | Sub-agents read, write, evict, replay |
| One model, one context | Bifurcated: fast retrieval, slow reasoning |
| Eviction = none | Snapshot isolation, time travel, rollback |
The phrase Chroma uses is "agentic search." The agent itself decides what to retrieve and when, makes parallel calls into different stores (vector, keyword, structured), inspects raw artifacts when text is lossy, and writes intermediate results that downstream sub-agents read. Retrieval is not a step. It is a loop primitive.
The sub-agent pattern that actually works
The reason this is not just "bigger context with extra steps" is that sub-agents encapsulate context. The parent agent never sees the 200K tokens of raw retrieved material. The sub-agent does the retrieval, reasons over it inside its own scoped context, writes a 5K-token summary, and returns that summary to the parent. The parent stays at 10K total.
The cost geometry of this pattern beats the long-context monolith on every axis:
- Latency: parallel sub-agents finish in roughly the time of one sequential turn.
- Quality: each sub-agent operates inside the 40K window where the model is sharp.
- Cost: 5K-token summaries are an order of magnitude cheaper than 200K-token contexts.
- Auditability: each sub-agent has its own trace. Failure can be pinned to a specific retrieval, not a fuzzed-out long context.
Chroma shipped a 20B-parameter agent specifically for this role: Context-1, running on Cerebras at 3000 tokens/sec, $1 per million output tokens. The pitch is not that it is smarter than Opus. The pitch is that it is the right shape of model for the retrieval-and-summary loop, and at 3000 tok/sec it finishes the loop fast enough that the parent agent does not stall.
What to instrument
If you are debugging agentic loops, the things that actually matter are not the ones every observability platform shows by default. Five signals worth wiring up:
- Tokens per turn, broken down by source. System prompt, tool catalog, history, retrieved docs, current user input. The category that grows fastest is the bottleneck.
- Drift detection. Has the agent's behavior on the same input changed since context length passed 30K? 50K? 100K? Replay golden inputs at each context size and diff the outputs.
- Eviction events. When sub-agents discard context, what did they discard? Was anything load-bearing thrown away? Without this, eviction is invisible.
- Per-tool token cost. Which MCP server is paying the most token tax for the value it delivers? Some tools are 90% catalog overhead and 10% actual use. Cut them.
- Cache hit rate. Anthropic's prompt cache (5-min and 1-hour TTLs) gives 10x cost reduction on cached prefixes. If your hit rate is low, your prompt structure is wrong (dynamic content at the front, static at the back).
The takeaway
The strongest signal that a team has actually shipped an agentic system in production is what they say when you mention million-token context windows. The teams that have shipped roll their eyes. The teams that have not are still excited about it.
Trust the 40K number. Decompose loops into sub-agents that encapsulate context. Use long context for the conversation history that needs to be there, not as a substitute for retrieval engineering. The model providers will keep advertising bigger windows. The engineers who ship will keep ignoring them.
Local-First AI
If this was useful, the weekly notes go deeper. No drip sequences, no upsells.
n8n templates, cost teardowns, and what is actually working in 2026. No drip sequences, no upsells. Reply to opt out.