2025-04-16

OpenAI o3 and o4-mini

Model naming, versions, and user confusion

Many find OpenAI’s model lineup (o1, o3, o4‑mini, 4o, 4.1, minis/nanos) bewildering and “razor blade”/toothpaste‑like.
Non–power users say it’s exhausting to know which model to use; some now discount OpenAI altogether for this reason.
Others argue the UI already chooses reasonable defaults and that an eventual “easy mode” or router that auto‑selects models is the right answer.
OpenAI staff in the thread acknowledge the naming mess, say o4‑mini replaces o3‑mini in ChatGPT, and describe a deprecation policy aimed at not breaking existing API apps.

Comparisons to Gemini, Claude and benchmarks

Large subthread contrasts o3/o4‑mini with Gemini 2.5 Pro and Claude 3.7 Sonnet, especially for coding.
Aider and SWE‑bench scores are cited on both sides; some note OpenAI’s internal numbers vs public leaderboards don’t always match, prompting trust concerns.
Several heavy users still prefer Gemini 2.5 Pro or Claude 3.7 for day‑to‑day coding (better long‑context handling, fewer gratuitous refactors, better adherence), while others say o3/o4‑mini are now state‑of‑the‑art on coding benches.
Multiple commenters think benchmarks are increasingly overfit and not very indicative of real‑world performance.

Progress, hype, and AGI

One camp feels the release cadence is historically fast and o3 is a real step up (especially reasoning + tools, visual editing, coding).
Another camp sees only incremental gains, lots of model churn, and “diminishing returns” relative to GPT‑4; some call the last year disappointing vs AGI hype.
AGI definitions are debated: some say we keep moving goalposts; others point to failures on logic puzzles, chess, niche technical questions as evidence we’re still far from anything like general reasoning.

Developer tools, pricing, and integration

Codex CLI is viewed as an open‑source answer to Claude Code / Aider: just a terminal frontend over OpenAI’s APIs, aimed at long‑term share in dev tooling. Early reports are mixed: impressive on some tasks, weak on others.
Pricing: o3 is cheaper per token than o1 but still far more expensive than Gemini for similar or slightly better performance; some Pro subscribers complain that $200/mo feels unjustified.
There’s frustration around rollout timing (“Try it now” before models appear), access gating for higher tiers, and knowledge cutoffs still stuck in 2023 for key models.

Reliability, hallucinations, and UX

Multiple concrete tests (astronomy dates, niche game reverse‑engineering, Linux/dracut, math research) show confident but wrong answers; some note o3 “knows” in its chain‑of‑thought that it’s guessing yet still answers decisively.
Others praise improvements: better philosophy discussions, stronger math/stats explanations, much better image editing and logo generation, and more concise code.
Consensus: models are powerful assistants but still untrustworthy on precise facts, niche domains, and complex tool‑driven workflows; users want clearer “I don’t know” behavior and less opaque benchmark marketing.

Related topics