2024-09-14

Terence Tao on O1

Perceived Capabilities of o1

Many see o1 as a clear step up from previous models, especially on structured reasoning tasks (math olympiad–style, programming contests, puzzles like NYT Connections).
Several report it behaving like a “mediocre but not incompetent grad student”: can follow nontrivial reasoning, propose strategies, and be prodded to better answers.
Some users report strong results in specialized areas:
- Optimizing already high‑performance Rust code.
- Suggesting useful MIP formulations and constraints in operations research.
- Helping clarify geometric/CAD questions (e.g., Bézier curve continuity).
Others find it similar or worse than GPT‑4o or Claude 3.5 Sonnet, but slower and more verbose.

Failures, Hallucinations, and Limits

Multiple examples where o1 confidently gives wrong math: basic inequalities, geometry, Euclid’s first postulate, network flow reductions, path‑via‑vertex algorithms.
For MIP and more complex OR tasks, it often produces plausible but incorrect formulations; experts emphasize that every constraint still needs careful checking.
Users note it can reproduce known strategies from existing work but rarely offers truly new “creative” ideas on open research problems.
Some report it timing out or “thinking” for tens of seconds with no answer.

Usefulness in Practice

Many programmers use LLMs as “junior devs” or “interns” for: boilerplate, API glue, unit tests, scripts, type annotations, refactors, documentation, and unfamiliar ecosystems.
Non‑CS users report going from zero to production apps or automation scripts by steering the model and fixing small issues.
Others find LLM‑written code stylistically poor, hard to maintain, and often slower to repair than to write from scratch, especially for niche algorithms, research code, or tricky concurrency.

Prompting, Tools, and Workflows

Effective use requires iterative prompting, breaking problems into small steps, and treating the model as a collaborator rather than a one‑shot oracle.
Tools like “agentic” editors (aider, Cursor, similar) wrap base models in loops that plan changes, edit code, run tests/linters, and retry.
Several draw parallels between chain‑of‑thought in LLMs and humans explicitly writing out definitions and steps when doing math.

Careers, Skills, and Attitudes

Some fear erosion of programming skills, degraded code quality, and downward pressure on salaries; others argue businesses care about delivered value, not hand‑written code.
There is debate over whether LLMs will mostly replace weak programmers, amplify strong ones, or eventually threaten even top‑tier experts.
The thread contains both strong enthusiasm (“massive productivity boost”) and deep skepticism (likening LLM hype to crypto/NFTs, citing persistent hallucinations and diminishing returns).

Proof Assistants and Verification

Several expect much higher value once models are better tuned on formal systems like Lean, where proofs can be mechanically verified.
Current Lean libraries cover only a small fraction of research math, and auto‑formalization from natural language remains unreliable.

Related topics