O3 Turns Pro
Perceived strengths and weaknesses of o3 / o3‑pro
- Widely seen as higher‑quality, “system 2” / deliberate reasoning compared to fast models; good at holding state across fragmented prompts and large, messy contexts.
- Major downside is latency (often 10–20 minutes); several people say this makes it unusable for iterative coding, but acceptable for deep analysis.
- Some commenters find it worse or “really bad,” preferring to avoid it entirely; others put it at the top of their personal trust hierarchy for confabulations.
- Compared against peers:
- Often considered better than Claude Opus at catching subtle issues, but more prone to false positives.
- Gemini sometimes seen as more contextually reliable and with fewer hallucinated bugs.
- 4o is preferred for “people-parsing” and fast, nuanced text tasks.
Use cases where slow reasoning is worth it
- Long‑form research reports and strategic analysis (e.g., startup landscapes, PE targeting, life decisions).
- Large‑scale code review across many files/services, where thoroughness beats speed.
- Complex JSON/data extraction with high correctness requirements.
- Legal workflows: parsing who’s who in complex email threads; some report o1‑preview or 4o still doing better for this.
Reliability, hallucinations, and “deep research”
- Deep research / web‑grounded tools are praised for detailed, up‑to‑date reports and source aggregation, but repeatedly described as “starting points,” not authoritative answers.
- Users note subtle errors, missing key actors in domains they know well, or relying on SEO winners; cross‑checking sources is considered mandatory.
- Skeptics argue that if you must verify everything anyway, direct search is simpler and more transparent.
Coding and multi‑model workflows
- Common pattern: send the same problem to multiple models (e.g., 4o → Gemini → o3‑pro) and synthesize their answers, sometimes even having one model evaluate the others.
- Some use o3‑pro in tools like Cursor specifically because it respects project structure and types better than faster models.
- Others find o3 flaky at basic edits/commands and favor Claude Code for autonomous implementation from ticket lists, with o3 as an independent auditor.
- Adjusting “reasoning effort” is reported to significantly change quality on reasoning models.
Latency, UX, and modality
- Several say chat UX clashes with 10–20 minute replies; an email‑like, long‑form correspondence model is proposed as a better fit for o3‑pro.
- Some note that o3‑pro has recently become somewhat faster, but still far from interactive.
Environmental and ethical considerations
- A subset worries about energy use when hitting multiple models per query; others treat it like streaming in 4K or air travel—acknowledged but not behavior‑changing.
- Debate on whether users should factor climate impact into everyday usage, versus pushing responsibility onto model providers.
Broader AI/AGI and profession impact
- Skepticism that current systems qualify as “AGI,” pointing out visible impact mostly on developers and content creators, not yet on roles like sales, accounting, law, or teaching.
- One commenter frames adoption decisions as cost/benefit rather than “trust,” using LLMs for work that isn’t worth a human’s time to perfect.