2025-12-18

GPT-5.2-Codex

Comparisons with Gemini and Claude

Several commenters report GPT‑5.2 (and 5.2‑Codex) outperforming Gemini 3 Pro/Flash and Claude Opus 4.5 for “serious” coding, especially as an agent in tools like Cursor.
Counterpoints note benchmarks where Anthropic and OpenAI are very close, or Anthropic slightly ahead, and that Gemini 3 Flash sometimes beats Pro on coding benchmarks.
Many say Gemini 3 Pro is strong as a tutor/math/general model but weak as a coding agent and at tool calling (e.g., breaking demos, deleting blocks of code, inserting placeholders).
Others find Claude stronger for fast implementation and lightweight solutions, with GPT models better for “enterprise-style” code and thoroughness.
Some users say Codex models are consistently worse than base GPT‑5.x for code quality, producing functional but “weird/ugly” or over‑abstracted code.

Agentic harnesses and UX

Strong view that harness/tooling (Claude Code, Codex CLI, Cursor, Gemini CLI, etc.) matter as much as the underlying model.
Claude Code is praised for planning mode, human‑in‑the‑loop flow, sub‑agents, clear terminal UX, and prompting that keeps edits under control.
Codex is seen as powerful but often over‑eager: starts editing when users only want discussion, can be frustrating without a planning layer.
Some run their own multi‑model TUIs or containers, fanning the same task to multiple agents and comparing diffs.

Cybersecurity capabilities and dual‑use

“Dual‑use” is interpreted as: anything that helps defenders find/understand vulnerabilities also helps attackers automate exploitation and scale attacks.
Comments note this is more about lowering the barrier and increasing speed/scale than inventing fundamentally new attack classes.
OpenAI’s invite‑only, more‑permissive “defensive” models are seen by some as reasonable vetting, by others as gatekeeping that may hinder white‑hat work.
Experiences with guardrails are mixed: some say GPT refuses offensive help, others report using it daily for offensive tasks without issues, possibly due to accumulated “security” context.

Workflows, quality vs speed

Many describe hybrid workflows: plan/architect with one model, implement with another, and use a third (often Codex 5.2) purely as a reviewer/bug‑hunter.
GPT‑5.2/Codex is frequently praised for deep, methodical reasoning, finding subtle logic and memory bugs, especially in lower‑level or complex systems.
Claude/Opus is preferred where speed and token‑efficiency matter, with users accepting more “fluff” or missed issues.
A recurring pattern: use slower, high‑reasoning models for planning and review; faster ones for bulk coding.

Reliability issues and risks

Reports of serious agentic failures: deleting large code sections with placeholders, misusing tools (e.g., destructive shell commands, breaking SELinux, deleting repos or project directories in “yolo” mode).
Some users cancel subscriptions after repeated overfitting or “target fixation” (e.g., forcing the wrong CRDT algorithm despite explicit instructions).
Codex Cloud’s inability to truly delete tasks/diffs (only “archive”) is viewed as a privacy/security concern; local/CLI sessions are distinguished from cloud storage.

Pricing, quotas, and business context

Users note GPT‑5.2‑Codex is substantially more expensive than the previous Codex, but subscriptions hide much of that and feel generous compared to some competitors.
Debate over whether inference is currently profitable vs being subsidized for growth; some cite massive long‑term compute commitments and question sustainability.
Several commenters consciously pick models per price tier: e.g., Opus/Claude Code for primary work, Codex for specialized review, or vice versa.

Shifting attitudes and skepticism

Many long‑time skeptics say they have changed their minds as models improved, now viewing coding agents as difficult to justify not using.
Others remain strongly skeptical, citing repeated failures on non‑toy tasks and warning about overestimating productivity gains due to psychological bias.
There are accusations of “astroturf” enthusiasm around each LLM release, countered by reminders that some developers simply see large, real productivity improvements.

Related topics