GPT-5.2

Model identity, training, and scaling

  • Many commenters doubt GPT‑5.2 is a genuinely new base model, suspecting continued pretraining on GPT‑4/4o weights plus more aggressive reasoning/RL rather than a full fresh run.
  • The new August 2025 knowledge cutoff is seen as evidence of either incremental pretraining or a late, rushed run triggered by Google’s Gemini 3 “code red.”
  • Discussion of a broader slowdown in pure scaling since GPT‑4: most frontier models are now improving mainly through reasoning, RL, and training data quality rather than huge parameter jumps. Hardware limits (GPU memory, MoE routing, interconnect) and datacenter constraints are a recurring theme.

Benchmarks, ARC‑AGI, and accusations of gaming

  • The big ARC‑AGI v2 jump (into low‑50% range) is widely noted; some call it “insane” and encouraging for generalization, others see it as a sign benchmarks are being explicitly trained on.
  • Debate over ARC‑AGI itself: some treat it like a robust IQ‑style test for reasoning; others argue it’s overfittable, vision‑heavy, or analogous to being good at contest math rather than “intelligence.”
  • OpenAI’s homegrown GDPval benchmark draws skepticism as an in‑house metric. There’s concern about cherry‑picked cross‑lab comparisons (e.g., omitting SWE‑Bench cases where rivals win).
  • Growing sentiment that benchmark saturation makes headline numbers less meaningful than long‑horizon, real‑world task performance.

Pricing, Pro tier, and economics

  • API prices for 5.2 are ~40% higher than 5.1; many question calling this “slight.” Some note it’s still cheaper than top Anthropic/Google tiers, others see this as the start of enshittification.
  • GPT‑5.2 Pro reasoning is viewed as “priced not to be used” except by highly price‑insensitive customers or for marketing benchmarks; reports of single prompts costing double‑digit dollars.
  • A few point out that reasoning on difficult benchmarks (e.g., ARC‑AGI) is dramatically cheaper than earlier o3‑style models, so “intelligence per dollar” has still improved.

Capabilities and UX: coding, vision, and spreadsheets

  • Coding: mixed experiences. Some find Codex + 5.x Thinking excellent for complex debugging and refactors; others still prefer Claude Code or Gemini 3 for reliability and speed, especially for UI work.
  • Vision remains notably subhuman. OpenAI’s own motherboard demo is criticized for mislabeling components; OpenAI staff acknowledge the example shows “better, not perfect” vision.
  • Spreadsheet and finance tasks (e.g., multi‑statement models, SEC parsing) are a standout positive anecdote; some see this as serious pressure on junior analyst roles.
  • Context handling: 400k API context and new “compaction” are praised, but ChatGPT web/app limits remain lower, and very long contexts still degrade quality.

Safety, hallucinations, and trust

  • Third‑party red‑teaming shows high refusal rates for naive harmful prompts but much weaker resistance under jailbreaks, especially around impersonation, harassment, and disinformation.
  • Many users remain frustrated by confident hallucinations in domains like electronics, physics, and niche technical details, arguing that better grounding and calibrated uncertainty matter more now than raw benchmark gains.

Competition and user migration

  • A sizable minority say they’ve switched primary usage to Gemini 3 or Claude (especially for coding and search‑heavy tasks), citing better day‑to‑day feel despite OpenAI’s benchmark claims.
  • Others still prefer ChatGPT for voice, overall polish, or reliability of deep reasoning, but agree that meaningful differentiation now lies more in UX, tools, and grounding than in another small reasoning bump.