Building more with GPT-5.1-Codex-Max
Release Timing & Competitive Landscape
- Many see the release as timed to counter a rival model launch, continuing a pattern of labs clustering big announcements to hijack each other’s hype.
- Some think this implies Codex-Max is an incremental checkpoint rather than a fundamental architecture shift, though coding benchmarks reportedly improve further over both predecessor and competitors.
- There’s debate over whether one company can “win” given platform control (e.g. browsers, search) versus OpenAI’s need to fight harder for distribution.
Benchmarks vs Real-World Coding
- Commenters focus heavily on METR/SWE/TerminalBench scores but multiple people doubt benchmarks reflect day-to-day coding, and worry about models being overfitted to evals.
- Direct side‑by‑side trials: several users report Codex outperforming a major competitor on planning and implementation for backend/logical tasks; others strongly prefer the competitor for planning and Codex for execution.
- Some say the new model is still weaker or slower than other top models (especially for UI/frontend), or not clearly better than earlier GPT-5.1 variants.
Long-Running Agents vs Iterative Assistance
- Marketing around “long‑running, detailed work” clashes with users who only trust tightly-scoped, interactive tasks.
- Codex is described as extremely literal and persistent: great for large refactors and deep adherence to instructions, but prone to absurd overreach (e.g. massive rewrites) if not carefully constrained.
- Competing tools are seen as faster, more “heuristic” or improvisational—good for quick web/UI work but more willing to ignore instructions, mock away tests, or wander off-task.
Compaction, Context & Technical Debates
- Codex-Max adds automatic “compaction” across long sessions; several note this is similar in spirit to prior agents and IDE summarization, but now trained into the model’s behavior.
- Discussion dives into why context windows are hard limits (quadratic attention, memory, error accumulation) and compares sparse/linear attention approaches in other models.
- Some welcome better long-context behavior; others mostly want short‑task quality and predictable iterative loops, not 6‑hour agents.
Tooling, Limits & Product Experience
- Codex CLI is praised for power but criticized as slow, opaque while running, and sometimes too locked-down (sandbox issues, timeouts, rate limits).
- Users request plan modes, finer-grained permissions, better context and subagent management, smaller/cheaper Codex variants, and access via standard chat UI.
- Broader frustration targets all vendors’ billing, account, and privacy UX—especially confusion and mistrust around one competitor’s subscriptions, rate limits, and training-on-user-code policies.