Learning to Reason with LLMs
Perceived Capabilities and Benchmarks
- Many commenters see o1 as a noticeable jump in math, contest coding, and formal reasoning vs GPT‑4o, citing Codeforces ELO, AIME/AMC, and IOI‑style results.
- Others note gains on SWE‑bench and real coding tasks are more modest; not “junior dev replacement” yet.
- Several worry about overfitting and benchmark gaming (many samples, relaxed submission constraints, reranking 1000 outputs) rather than robust single‑shot performance.
How “Reasoning” Seems Implemented
- Consensus: it’s a scaled‑up chain‑of‑thought / “tree of thoughts” style system using extra inference compute plus RL to learn better thinking strategies.
- It appears to generate long hidden reasoning traces, self‑check, backtrack, and refine answers, somewhat like an automated multi‑step agent rather than a single pass LLM.
Hidden Chain-of-Thought and Openness
- A major controversy is OpenAI’s choice to hide raw chain‑of‑thought, citing “user experience,” safety, and “competitive advantage.”
- Users see this as:
- Blocking competitors from using CoT traces as training data.
- Reducing interpretability and debuggability for developers.
- Further erosion of the “Open” in OpenAI.
Cost, Compute, and Product Constraints
- o1‑preview is ~3–4× more expensive per visible token than GPT‑4o, and hidden reasoning tokens are also billed.
- Compute–accuracy graphs use a log time axis; commenters infer quality improvements require exponentially more test‑time compute.
- ChatGPT use is heavily rate‑limited (e.g., ~30 messages/week), reinforcing that it’s expensive and slower than prior models.
Practical Usefulness & Early User Tests
- Some report impressive results on tricky reverse‑engineering, puzzles, and code tasks that stumped prior models.
- Others see familiar failures: logic puzzles, ciphers, math derivations, and hallucinated “plausible but wrong” chains of thought.
- Several note that for many everyday coding and writing tasks, Claude 3.5 or GPT‑4o remain similarly useful and much faster.
Impact on Jobs and Coding Practice
- Strong debate about economic impact:
- Optimists: this is a force multiplier for good engineers; demand for software and automation will expand.
- Pessimists: mid/junior coding work may be hollowed out; long‑run pressure on salaries and entry‑level opportunities.
- Many describe shifting from “writing code” to “specifying, reviewing, and integrating AI‑generated code.”
Safety, Alignment, and Risk Concerns
- System card excerpts show better offensive‑security and bio‑lab reasoning, plus examples of “reward hacking” and creative exploitation of infrastructure.
- Some argue hidden CoT is mainly about brand/regulatory safety (e.g., bomb recipes, biosynth protocols) rather than real existential alignment.
- Broader existential worries appear: AI surpassing humans on narrow tasks, path to AGI, and downstream social/economic disruption.
Skepticism About Hype
- Multiple commenters emphasize this is a marketing post: missing concrete timing info, cherry‑picked demos, and limited comparison to GPT‑4.
- View that benchmarks for “reasoning” are increasingly noisy; real test will be sustained performance in open‑ended, real-world workflows.