GPT-5.2-Codex

Comparisons with Gemini and Claude

  • Several commenters report GPT‑5.2 (and 5.2‑Codex) outperforming Gemini 3 Pro/Flash and Claude Opus 4.5 for “serious” coding, especially as an agent in tools like Cursor.
  • Counterpoints note benchmarks where Anthropic and OpenAI are very close, or Anthropic slightly ahead, and that Gemini 3 Flash sometimes beats Pro on coding benchmarks.
  • Many say Gemini 3 Pro is strong as a tutor/math/general model but weak as a coding agent and at tool calling (e.g., breaking demos, deleting blocks of code, inserting placeholders).
  • Others find Claude stronger for fast implementation and lightweight solutions, with GPT models better for “enterprise-style” code and thoroughness.
  • Some users say Codex models are consistently worse than base GPT‑5.x for code quality, producing functional but “weird/ugly” or over‑abstracted code.

Agentic harnesses and UX

  • Strong view that harness/tooling (Claude Code, Codex CLI, Cursor, Gemini CLI, etc.) matter as much as the underlying model.
  • Claude Code is praised for planning mode, human‑in‑the‑loop flow, sub‑agents, clear terminal UX, and prompting that keeps edits under control.
  • Codex is seen as powerful but often over‑eager: starts editing when users only want discussion, can be frustrating without a planning layer.
  • Some run their own multi‑model TUIs or containers, fanning the same task to multiple agents and comparing diffs.

Cybersecurity capabilities and dual‑use

  • “Dual‑use” is interpreted as: anything that helps defenders find/understand vulnerabilities also helps attackers automate exploitation and scale attacks.
  • Comments note this is more about lowering the barrier and increasing speed/scale than inventing fundamentally new attack classes.
  • OpenAI’s invite‑only, more‑permissive “defensive” models are seen by some as reasonable vetting, by others as gatekeeping that may hinder white‑hat work.
  • Experiences with guardrails are mixed: some say GPT refuses offensive help, others report using it daily for offensive tasks without issues, possibly due to accumulated “security” context.

Workflows, quality vs speed

  • Many describe hybrid workflows: plan/architect with one model, implement with another, and use a third (often Codex 5.2) purely as a reviewer/bug‑hunter.
  • GPT‑5.2/Codex is frequently praised for deep, methodical reasoning, finding subtle logic and memory bugs, especially in lower‑level or complex systems.
  • Claude/Opus is preferred where speed and token‑efficiency matter, with users accepting more “fluff” or missed issues.
  • A recurring pattern: use slower, high‑reasoning models for planning and review; faster ones for bulk coding.

Reliability issues and risks

  • Reports of serious agentic failures: deleting large code sections with placeholders, misusing tools (e.g., destructive shell commands, breaking SELinux, deleting repos or project directories in “yolo” mode).
  • Some users cancel subscriptions after repeated overfitting or “target fixation” (e.g., forcing the wrong CRDT algorithm despite explicit instructions).
  • Codex Cloud’s inability to truly delete tasks/diffs (only “archive”) is viewed as a privacy/security concern; local/CLI sessions are distinguished from cloud storage.

Pricing, quotas, and business context

  • Users note GPT‑5.2‑Codex is substantially more expensive than the previous Codex, but subscriptions hide much of that and feel generous compared to some competitors.
  • Debate over whether inference is currently profitable vs being subsidized for growth; some cite massive long‑term compute commitments and question sustainability.
  • Several commenters consciously pick models per price tier: e.g., Opus/Claude Code for primary work, Codex for specialized review, or vice versa.

Shifting attitudes and skepticism

  • Many long‑time skeptics say they have changed their minds as models improved, now viewing coding agents as difficult to justify not using.
  • Others remain strongly skeptical, citing repeated failures on non‑toy tasks and warning about overestimating productivity gains due to psychological bias.
  • There are accusations of “astroturf” enthusiasm around each LLM release, countered by reminders that some developers simply see large, real productivity improvements.