Learning to Reason with LLMs

Perceived Capabilities and Benchmarks

  • Many commenters see o1 as a noticeable jump in math, contest coding, and formal reasoning vs GPT‑4o, citing Codeforces ELO, AIME/AMC, and IOI‑style results.
  • Others note gains on SWE‑bench and real coding tasks are more modest; not “junior dev replacement” yet.
  • Several worry about overfitting and benchmark gaming (many samples, relaxed submission constraints, reranking 1000 outputs) rather than robust single‑shot performance.

How “Reasoning” Seems Implemented

  • Consensus: it’s a scaled‑up chain‑of‑thought / “tree of thoughts” style system using extra inference compute plus RL to learn better thinking strategies.
  • It appears to generate long hidden reasoning traces, self‑check, backtrack, and refine answers, somewhat like an automated multi‑step agent rather than a single pass LLM.

Hidden Chain-of-Thought and Openness

  • A major controversy is OpenAI’s choice to hide raw chain‑of‑thought, citing “user experience,” safety, and “competitive advantage.”
  • Users see this as:
    • Blocking competitors from using CoT traces as training data.
    • Reducing interpretability and debuggability for developers.
    • Further erosion of the “Open” in OpenAI.

Cost, Compute, and Product Constraints

  • o1‑preview is ~3–4× more expensive per visible token than GPT‑4o, and hidden reasoning tokens are also billed.
  • Compute–accuracy graphs use a log time axis; commenters infer quality improvements require exponentially more test‑time compute.
  • ChatGPT use is heavily rate‑limited (e.g., ~30 messages/week), reinforcing that it’s expensive and slower than prior models.

Practical Usefulness & Early User Tests

  • Some report impressive results on tricky reverse‑engineering, puzzles, and code tasks that stumped prior models.
  • Others see familiar failures: logic puzzles, ciphers, math derivations, and hallucinated “plausible but wrong” chains of thought.
  • Several note that for many everyday coding and writing tasks, Claude 3.5 or GPT‑4o remain similarly useful and much faster.

Impact on Jobs and Coding Practice

  • Strong debate about economic impact:
    • Optimists: this is a force multiplier for good engineers; demand for software and automation will expand.
    • Pessimists: mid/junior coding work may be hollowed out; long‑run pressure on salaries and entry‑level opportunities.
  • Many describe shifting from “writing code” to “specifying, reviewing, and integrating AI‑generated code.”

Safety, Alignment, and Risk Concerns

  • System card excerpts show better offensive‑security and bio‑lab reasoning, plus examples of “reward hacking” and creative exploitation of infrastructure.
  • Some argue hidden CoT is mainly about brand/regulatory safety (e.g., bomb recipes, biosynth protocols) rather than real existential alignment.
  • Broader existential worries appear: AI surpassing humans on narrow tasks, path to AGI, and downstream social/economic disruption.

Skepticism About Hype

  • Multiple commenters emphasize this is a marketing post: missing concrete timing info, cherry‑picked demos, and limited comparison to GPT‑4.
  • View that benchmarks for “reasoning” are increasingly noisy; real test will be sustained performance in open‑ended, real-world workflows.