2025-06-17

O3 Turns Pro

Perceived strengths and weaknesses of o3 / o3‑pro

Widely seen as higher‑quality, “system 2” / deliberate reasoning compared to fast models; good at holding state across fragmented prompts and large, messy contexts.
Major downside is latency (often 10–20 minutes); several people say this makes it unusable for iterative coding, but acceptable for deep analysis.
Some commenters find it worse or “really bad,” preferring to avoid it entirely; others put it at the top of their personal trust hierarchy for confabulations.
Compared against peers:
- Often considered better than Claude Opus at catching subtle issues, but more prone to false positives.
- Gemini sometimes seen as more contextually reliable and with fewer hallucinated bugs.
- 4o is preferred for “people-parsing” and fast, nuanced text tasks.

Use cases where slow reasoning is worth it

Long‑form research reports and strategic analysis (e.g., startup landscapes, PE targeting, life decisions).
Large‑scale code review across many files/services, where thoroughness beats speed.
Complex JSON/data extraction with high correctness requirements.
Legal workflows: parsing who’s who in complex email threads; some report o1‑preview or 4o still doing better for this.

Reliability, hallucinations, and “deep research”

Deep research / web‑grounded tools are praised for detailed, up‑to‑date reports and source aggregation, but repeatedly described as “starting points,” not authoritative answers.
Users note subtle errors, missing key actors in domains they know well, or relying on SEO winners; cross‑checking sources is considered mandatory.
Skeptics argue that if you must verify everything anyway, direct search is simpler and more transparent.

Coding and multi‑model workflows

Common pattern: send the same problem to multiple models (e.g., 4o → Gemini → o3‑pro) and synthesize their answers, sometimes even having one model evaluate the others.
Some use o3‑pro in tools like Cursor specifically because it respects project structure and types better than faster models.
Others find o3 flaky at basic edits/commands and favor Claude Code for autonomous implementation from ticket lists, with o3 as an independent auditor.
Adjusting “reasoning effort” is reported to significantly change quality on reasoning models.

Latency, UX, and modality

Several say chat UX clashes with 10–20 minute replies; an email‑like, long‑form correspondence model is proposed as a better fit for o3‑pro.
Some note that o3‑pro has recently become somewhat faster, but still far from interactive.

Environmental and ethical considerations

A subset worries about energy use when hitting multiple models per query; others treat it like streaming in 4K or air travel—acknowledged but not behavior‑changing.
Debate on whether users should factor climate impact into everyday usage, versus pushing responsibility onto model providers.

Broader AI/AGI and profession impact

Skepticism that current systems qualify as “AGI,” pointing out visible impact mostly on developers and content creators, not yet on roles like sales, accounting, law, or teaching.
One commenter frames adoption decisions as cost/benefit rather than “trust,” using LLMs for work that isn’t worth a human’s time to perfect.

Related topics