2025-11-09

Grok 4 Fast now has 2M context window

Long context windows: feasibility vs reality

Several comments stress that supporting a 2M window is easy to claim, but making good use of it is hard.
Main technical limits mentioned:
- Attention is O(N²) in sequence length, so latency and throughput get bad at large contexts.
- Training on very long sequences is prohibitively expensive and long documents are relatively scarce.
Many models are trained on shorter contexts and then extended with positional-encoding tricks (RoPE, YaRN). This yields “capability” but not necessarily strong long-context performance.
Some argue vendors conflate “long prompt” with “long true context”, often compressing or dropping middle tokens.

Empirical behavior of long-context models

Users report that models often overweight the start and end of a prompt and underweight the middle.
Benchmarks and anecdotes suggest accuracy degrades as context length grows; retrieval of a single fact is easier than reasoning over many dispersed facts.
Others report surprisingly good results: dumping whole (smallish) codebases or hundreds of thousands of tokens of logs/manuals into Gemini or Grok and getting useful output.
There’s debate on whether you should avoid large contexts and aggressively modularize tasks vs. “more context always wins” if you can skip elaborate RAG/preprocessing.

Grok’s quality, speed, and use cases

Numerous users praise Grok Code/Fast as:
- Very cheap and extremely fast; speed is seen as a major productivity boost.
- Strong for DevOps configs, refactors, and certain data-extraction tasks; sometimes outperforming Claude, OpenAI, and Gemini on specific codebases.
- Looser on safety filters, enabling use cases other models block (both legitimate and “NSFW” ones).
Others find Grok unreliable or worse than Claude/GPT on complex coding and design tasks, or annoyed by its earlier “snarky/edgy” persona.
Integration gaps (e.g., weaker editor/CLI tooling) and recent agent/RLHF changes are reported to have hurt usability for some.

Politics, trust, and bias

A large subthread revolves around distrust of Grok due to its owner’s politics, broken promises, and perceived manipulation of the model’s worldview.
Some see Grok as uniquely dangerous because it is explicitly steered toward a particular ideology; incidents of prompt-level political interference are cited.
Others argue all major LLMs are biased and commercially driven; the pragmatic stance is to use multiple vendors and “read between the lines.”
There is also concern over whether Grok (or any provider) truly honors “no training on paid API data” and who can be trusted with sensitive code.

Context window vs “real” quality metrics

Some commenters dismiss record context or tokens/sec as marketing metrics, arguing that overall reasoning quality matters more.
Others counter that context length and speed are real, orthogonal dimensions on a Pareto frontier: many practical workflows (large codebases, long logs, technical manuals) directly benefit from higher context and lower latency.
Several highlight that even with huge windows, good prompting, task decomposition, and tool use (tests, build commands, sub-agents) remain critical for non-garbage refactors.

Related topics