GPT-5.2-Codex
Comparisons with Gemini and Claude
- Several commenters report GPT‑5.2 (and 5.2‑Codex) outperforming Gemini 3 Pro/Flash and Claude Opus 4.5 for “serious” coding, especially as an agent in tools like Cursor.
- Counterpoints note benchmarks where Anthropic and OpenAI are very close, or Anthropic slightly ahead, and that Gemini 3 Flash sometimes beats Pro on coding benchmarks.
- Many say Gemini 3 Pro is strong as a tutor/math/general model but weak as a coding agent and at tool calling (e.g., breaking demos, deleting blocks of code, inserting placeholders).
- Others find Claude stronger for fast implementation and lightweight solutions, with GPT models better for “enterprise-style” code and thoroughness.
- Some users say Codex models are consistently worse than base GPT‑5.x for code quality, producing functional but “weird/ugly” or over‑abstracted code.
Agentic harnesses and UX
- Strong view that harness/tooling (Claude Code, Codex CLI, Cursor, Gemini CLI, etc.) matter as much as the underlying model.
- Claude Code is praised for planning mode, human‑in‑the‑loop flow, sub‑agents, clear terminal UX, and prompting that keeps edits under control.
- Codex is seen as powerful but often over‑eager: starts editing when users only want discussion, can be frustrating without a planning layer.
- Some run their own multi‑model TUIs or containers, fanning the same task to multiple agents and comparing diffs.
Cybersecurity capabilities and dual‑use
- “Dual‑use” is interpreted as: anything that helps defenders find/understand vulnerabilities also helps attackers automate exploitation and scale attacks.
- Comments note this is more about lowering the barrier and increasing speed/scale than inventing fundamentally new attack classes.
- OpenAI’s invite‑only, more‑permissive “defensive” models are seen by some as reasonable vetting, by others as gatekeeping that may hinder white‑hat work.
- Experiences with guardrails are mixed: some say GPT refuses offensive help, others report using it daily for offensive tasks without issues, possibly due to accumulated “security” context.
Workflows, quality vs speed
- Many describe hybrid workflows: plan/architect with one model, implement with another, and use a third (often Codex 5.2) purely as a reviewer/bug‑hunter.
- GPT‑5.2/Codex is frequently praised for deep, methodical reasoning, finding subtle logic and memory bugs, especially in lower‑level or complex systems.
- Claude/Opus is preferred where speed and token‑efficiency matter, with users accepting more “fluff” or missed issues.
- A recurring pattern: use slower, high‑reasoning models for planning and review; faster ones for bulk coding.
Reliability issues and risks
- Reports of serious agentic failures: deleting large code sections with placeholders, misusing tools (e.g., destructive shell commands, breaking SELinux, deleting repos or project directories in “yolo” mode).
- Some users cancel subscriptions after repeated overfitting or “target fixation” (e.g., forcing the wrong CRDT algorithm despite explicit instructions).
- Codex Cloud’s inability to truly delete tasks/diffs (only “archive”) is viewed as a privacy/security concern; local/CLI sessions are distinguished from cloud storage.
Pricing, quotas, and business context
- Users note GPT‑5.2‑Codex is substantially more expensive than the previous Codex, but subscriptions hide much of that and feel generous compared to some competitors.
- Debate over whether inference is currently profitable vs being subsidized for growth; some cite massive long‑term compute commitments and question sustainability.
- Several commenters consciously pick models per price tier: e.g., Opus/Claude Code for primary work, Codex for specialized review, or vice versa.
Shifting attitudes and skepticism
- Many long‑time skeptics say they have changed their minds as models improved, now viewing coding agents as difficult to justify not using.
- Others remain strongly skeptical, citing repeated failures on non‑toy tasks and warning about overestimating productivity gains due to psychological bias.
- There are accusations of “astroturf” enthusiasm around each LLM release, countered by reminders that some developers simply see large, real productivity improvements.