2024-09-12

Learning to Reason with LLMs

Perceived Capabilities and Benchmarks

Many commenters see o1 as a noticeable jump in math, contest coding, and formal reasoning vs GPT‑4o, citing Codeforces ELO, AIME/AMC, and IOI‑style results.
Others note gains on SWE‑bench and real coding tasks are more modest; not “junior dev replacement” yet.
Several worry about overfitting and benchmark gaming (many samples, relaxed submission constraints, reranking 1000 outputs) rather than robust single‑shot performance.

How “Reasoning” Seems Implemented

Consensus: it’s a scaled‑up chain‑of‑thought / “tree of thoughts” style system using extra inference compute plus RL to learn better thinking strategies.
It appears to generate long hidden reasoning traces, self‑check, backtrack, and refine answers, somewhat like an automated multi‑step agent rather than a single pass LLM.

Hidden Chain-of-Thought and Openness

A major controversy is OpenAI’s choice to hide raw chain‑of‑thought, citing “user experience,” safety, and “competitive advantage.”
Users see this as:
- Blocking competitors from using CoT traces as training data.
- Reducing interpretability and debuggability for developers.
- Further erosion of the “Open” in OpenAI.

Cost, Compute, and Product Constraints

o1‑preview is ~3–4× more expensive per visible token than GPT‑4o, and hidden reasoning tokens are also billed.
Compute–accuracy graphs use a log time axis; commenters infer quality improvements require exponentially more test‑time compute.
ChatGPT use is heavily rate‑limited (e.g., ~30 messages/week), reinforcing that it’s expensive and slower than prior models.

Practical Usefulness & Early User Tests

Some report impressive results on tricky reverse‑engineering, puzzles, and code tasks that stumped prior models.
Others see familiar failures: logic puzzles, ciphers, math derivations, and hallucinated “plausible but wrong” chains of thought.
Several note that for many everyday coding and writing tasks, Claude 3.5 or GPT‑4o remain similarly useful and much faster.

Impact on Jobs and Coding Practice

Strong debate about economic impact:
- Optimists: this is a force multiplier for good engineers; demand for software and automation will expand.
- Pessimists: mid/junior coding work may be hollowed out; long‑run pressure on salaries and entry‑level opportunities.
Many describe shifting from “writing code” to “specifying, reviewing, and integrating AI‑generated code.”

Safety, Alignment, and Risk Concerns

System card excerpts show better offensive‑security and bio‑lab reasoning, plus examples of “reward hacking” and creative exploitation of infrastructure.
Some argue hidden CoT is mainly about brand/regulatory safety (e.g., bomb recipes, biosynth protocols) rather than real existential alignment.
Broader existential worries appear: AI surpassing humans on narrow tasks, path to AGI, and downstream social/economic disruption.

Skepticism About Hype

Multiple commenters emphasize this is a marketing post: missing concrete timing info, cherry‑picked demos, and limited comparison to GPT‑4.
View that benchmarks for “reasoning” are increasingly noisy; real test will be sustained performance in open‑ended, real-world workflows.

Related topics