OpenAI o3 and o4-mini

Model naming, versions, and user confusion

  • Many find OpenAI’s model lineup (o1, o3, o4‑mini, 4o, 4.1, minis/nanos) bewildering and “razor blade”/toothpaste‑like.
  • Non–power users say it’s exhausting to know which model to use; some now discount OpenAI altogether for this reason.
  • Others argue the UI already chooses reasonable defaults and that an eventual “easy mode” or router that auto‑selects models is the right answer.
  • OpenAI staff in the thread acknowledge the naming mess, say o4‑mini replaces o3‑mini in ChatGPT, and describe a deprecation policy aimed at not breaking existing API apps.

Comparisons to Gemini, Claude and benchmarks

  • Large subthread contrasts o3/o4‑mini with Gemini 2.5 Pro and Claude 3.7 Sonnet, especially for coding.
  • Aider and SWE‑bench scores are cited on both sides; some note OpenAI’s internal numbers vs public leaderboards don’t always match, prompting trust concerns.
  • Several heavy users still prefer Gemini 2.5 Pro or Claude 3.7 for day‑to‑day coding (better long‑context handling, fewer gratuitous refactors, better adherence), while others say o3/o4‑mini are now state‑of‑the‑art on coding benches.
  • Multiple commenters think benchmarks are increasingly overfit and not very indicative of real‑world performance.

Progress, hype, and AGI

  • One camp feels the release cadence is historically fast and o3 is a real step up (especially reasoning + tools, visual editing, coding).
  • Another camp sees only incremental gains, lots of model churn, and “diminishing returns” relative to GPT‑4; some call the last year disappointing vs AGI hype.
  • AGI definitions are debated: some say we keep moving goalposts; others point to failures on logic puzzles, chess, niche technical questions as evidence we’re still far from anything like general reasoning.

Developer tools, pricing, and integration

  • Codex CLI is viewed as an open‑source answer to Claude Code / Aider: just a terminal frontend over OpenAI’s APIs, aimed at long‑term share in dev tooling. Early reports are mixed: impressive on some tasks, weak on others.
  • Pricing: o3 is cheaper per token than o1 but still far more expensive than Gemini for similar or slightly better performance; some Pro subscribers complain that $200/mo feels unjustified.
  • There’s frustration around rollout timing (“Try it now” before models appear), access gating for higher tiers, and knowledge cutoffs still stuck in 2023 for key models.

Reliability, hallucinations, and UX

  • Multiple concrete tests (astronomy dates, niche game reverse‑engineering, Linux/dracut, math research) show confident but wrong answers; some note o3 “knows” in its chain‑of‑thought that it’s guessing yet still answers decisively.
  • Others praise improvements: better philosophy discussions, stronger math/stats explanations, much better image editing and logo generation, and more concise code.
  • Consensus: models are powerful assistants but still untrustworthy on precise facts, niche domains, and complex tool‑driven workflows; users want clearer “I don’t know” behavior and less opaque benchmark marketing.