2026-06-14

Don't trust large context windows

Experience with Large Context Windows

Strong disagreement on “dumb zone” severity and threshold.
- Some see clear degradation once context exceeds ~10–20% of the window (e.g., 100–200k in a 1M window), with more recall errors and “tainted” paths that are hard to recover from.
- Others report stable behavior well past 300–500k, even 700–900k tokens, and say they rarely think about context at all.
Many describe performance as highly task‑dependent: simple “plumbing” or localized edits tolerate long sessions better than complex, holistic refactors.
Several note that context quality and consistency may matter more than raw length; “debris,” conflicting instructions, and repeated failed attempts seem to poison later reasoning.

Context Management Strategies

Common pattern: keep top‑level conversations short and offload work into sub‑agents, workflows, or recursive calls that have their own fresh contexts.
Others periodically reset: summarize the plan or current state into markdown (AGENTS.md, DESIGN, ROADMAP, etc.), start a fresh session from that, and continue.
Some enforce hard caps (e.g., 15–20% of window, 200–400k tokens), auto‑compaction thresholds, or “no brown M&Ms” checks (e.g., fail if the model forgets a custom build command).
A minority aggressively minimize context: one conversation per feature, manual “/last”‑style compaction, or insisting on design docs/PRDs before coding.
Several argue that external documents in the repo are better “memory” than stuffing facts into the live context or proprietary memory systems, which can store wrong or stale data.

Model, Harness, and Sampling Differences

Behavior varies by model version and tooling. Some report older or cheaper models degrading earlier; newer frontier models and certain tools (e.g., agent frameworks, workflows) improve long‑run reliability.
Attention mechanisms, tokenizer granularity, and sampling strategies (e.g., “modern” samplers like min_p) are cited as potential reasons experiences diverge, but this remains anecdotal.

Cost, UX, and Vendor Incentives

One camp runs multiple agents in parallel and scores them (e.g., ELO‑style) despite high token use, arguing outcome and time savings dominate cost.
Others call this wasteful and emphasize prompt caching, incremental updates, and stricter scoping to reduce tokens.
Many still see 1M‑token windows as a major UX improvement, even if not always used safely.

Rigor, Benchmarks, and Uncertainty

Several criticize the discussion as largely anecdotal “gardening advice” and call for systematic evals that pre‑fill context before standardized tests.
Others respond that rapid model churn makes stable, up‑to‑date benchmarks hard, so practitioners lean on evolving heuristics.

Related topics