Building more with GPT-5.1-Codex-Max

Release Timing & Competitive Landscape

  • Many see the release as timed to counter a rival model launch, continuing a pattern of labs clustering big announcements to hijack each other’s hype.
  • Some think this implies Codex-Max is an incremental checkpoint rather than a fundamental architecture shift, though coding benchmarks reportedly improve further over both predecessor and competitors.
  • There’s debate over whether one company can “win” given platform control (e.g. browsers, search) versus OpenAI’s need to fight harder for distribution.

Benchmarks vs Real-World Coding

  • Commenters focus heavily on METR/SWE/TerminalBench scores but multiple people doubt benchmarks reflect day-to-day coding, and worry about models being overfitted to evals.
  • Direct side‑by‑side trials: several users report Codex outperforming a major competitor on planning and implementation for backend/logical tasks; others strongly prefer the competitor for planning and Codex for execution.
  • Some say the new model is still weaker or slower than other top models (especially for UI/frontend), or not clearly better than earlier GPT-5.1 variants.

Long-Running Agents vs Iterative Assistance

  • Marketing around “long‑running, detailed work” clashes with users who only trust tightly-scoped, interactive tasks.
  • Codex is described as extremely literal and persistent: great for large refactors and deep adherence to instructions, but prone to absurd overreach (e.g. massive rewrites) if not carefully constrained.
  • Competing tools are seen as faster, more “heuristic” or improvisational—good for quick web/UI work but more willing to ignore instructions, mock away tests, or wander off-task.

Compaction, Context & Technical Debates

  • Codex-Max adds automatic “compaction” across long sessions; several note this is similar in spirit to prior agents and IDE summarization, but now trained into the model’s behavior.
  • Discussion dives into why context windows are hard limits (quadratic attention, memory, error accumulation) and compares sparse/linear attention approaches in other models.
  • Some welcome better long-context behavior; others mostly want short‑task quality and predictable iterative loops, not 6‑hour agents.

Tooling, Limits & Product Experience

  • Codex CLI is praised for power but criticized as slow, opaque while running, and sometimes too locked-down (sandbox issues, timeouts, rate limits).
  • Users request plan modes, finer-grained permissions, better context and subagent management, smaller/cheaper Codex variants, and access via standard chat UI.
  • Broader frustration targets all vendors’ billing, account, and privacy UX—especially confusion and mistrust around one competitor’s subscriptions, rate limits, and training-on-user-code policies.