O3 Turns Pro

Perceived strengths and weaknesses of o3 / o3‑pro

  • Widely seen as higher‑quality, “system 2” / deliberate reasoning compared to fast models; good at holding state across fragmented prompts and large, messy contexts.
  • Major downside is latency (often 10–20 minutes); several people say this makes it unusable for iterative coding, but acceptable for deep analysis.
  • Some commenters find it worse or “really bad,” preferring to avoid it entirely; others put it at the top of their personal trust hierarchy for confabulations.
  • Compared against peers:
    • Often considered better than Claude Opus at catching subtle issues, but more prone to false positives.
    • Gemini sometimes seen as more contextually reliable and with fewer hallucinated bugs.
    • 4o is preferred for “people-parsing” and fast, nuanced text tasks.

Use cases where slow reasoning is worth it

  • Long‑form research reports and strategic analysis (e.g., startup landscapes, PE targeting, life decisions).
  • Large‑scale code review across many files/services, where thoroughness beats speed.
  • Complex JSON/data extraction with high correctness requirements.
  • Legal workflows: parsing who’s who in complex email threads; some report o1‑preview or 4o still doing better for this.

Reliability, hallucinations, and “deep research”

  • Deep research / web‑grounded tools are praised for detailed, up‑to‑date reports and source aggregation, but repeatedly described as “starting points,” not authoritative answers.
  • Users note subtle errors, missing key actors in domains they know well, or relying on SEO winners; cross‑checking sources is considered mandatory.
  • Skeptics argue that if you must verify everything anyway, direct search is simpler and more transparent.

Coding and multi‑model workflows

  • Common pattern: send the same problem to multiple models (e.g., 4o → Gemini → o3‑pro) and synthesize their answers, sometimes even having one model evaluate the others.
  • Some use o3‑pro in tools like Cursor specifically because it respects project structure and types better than faster models.
  • Others find o3 flaky at basic edits/commands and favor Claude Code for autonomous implementation from ticket lists, with o3 as an independent auditor.
  • Adjusting “reasoning effort” is reported to significantly change quality on reasoning models.

Latency, UX, and modality

  • Several say chat UX clashes with 10–20 minute replies; an email‑like, long‑form correspondence model is proposed as a better fit for o3‑pro.
  • Some note that o3‑pro has recently become somewhat faster, but still far from interactive.

Environmental and ethical considerations

  • A subset worries about energy use when hitting multiple models per query; others treat it like streaming in 4K or air travel—acknowledged but not behavior‑changing.
  • Debate on whether users should factor climate impact into everyday usage, versus pushing responsibility onto model providers.

Broader AI/AGI and profession impact

  • Skepticism that current systems qualify as “AGI,” pointing out visible impact mostly on developers and content creators, not yet on roles like sales, accounting, law, or teaching.
  • One commenter frames adoption decisions as cost/benefit rather than “trust,” using LLMs for work that isn’t worth a human’s time to perfect.