Notes on OpenAI's new o1 chain-of-thought models

Model design and training paradigm

  • Many speculate o1 is essentially a GPT‑4‑class base model wrapped in an agentic loop that iteratively “thinks,” backtracks, and re-prompts itself.
  • Some see this as a generalized version of techniques already used in LangChain/DSPy/agent frameworks; others argue the RL and orchestration here are non‑trivial.
  • There is discussion of o1/o1‑mini being used to generate synthetic chain‑of‑thought data to train “GPT‑5/Orion,” enabling a ladder of self‑bootstrapping models.
  • The hidden reasoning tokens are believed to be partly about preventing competitors from scraping CoT traces and partly about enabling unaligned internal reasoning that gets post‑filtered.

Capabilities, reasoning, and hallucinations

  • Users report clear gains on multi‑step math, programming competitions, logic puzzles, and some code refactoring tasks; o1‑mini often shines on symbolic problems with less world knowledge.
  • At the same time, many examples show o1‑preview still hallucinating APIs, legal rules, game logic, chess facts, and basic math/logic, sometimes with long but wrong chains-of-thought.
  • Some say it’s meaningfully better than GPT‑4o in “hard” tasks if you keep pushing it; others find minimal real‑world improvement relative to extra cost and latency.
  • Several note it’s worse than earlier models at obscure factual recall; it remains “a lossy compressed database plus pattern matcher,” not a reliable encyclopedia.

Developer experience, tooling, and pricing

  • Hidden reasoning tokens worry developers: harder to debug where reasoning went wrong, and users pay for tokens they can’t see or verify.
  • Some accept this as long as average call cost is budget‑friendly; others see risk of opaque, unbounded price changes and call it anti‑competitive.
  • Lack of tools, system prompts, streaming and multimodality (for now) is attributed to beta status; many expect future o1 variants with tools, RAG, code execution, and multimodal CoT.

Impact on programming work

  • Several engineers say current LLMs (Claude, GPT‑4o, Cursor, etc.) already generate most of their day‑to‑day code; o1 is viewed as another step, not a revolution.
  • Others report LLMs failing on the exact hard problems where they need help, or requiring so much babysitting that any productivity gain disappears.
  • There’s debate on whether agentic, reasoning‑heavy models will make a single engineer as productive as “five,” or just flood the world with more brittle, barely‑understood software.

Benchmarks, hype, and terminology

  • Strong scores on AIME/GPQA and coding benchmarks are contrasted with users saying “I can’t feel the difference” in normal workflows.
  • Some see safety talk and “PhD‑level reasoning” claims as marketing that overstates real capabilities; others note incremental but genuine gains.
  • Multiple commenters object to terms like “reasoning,” “thought,” and “intelligence” for LLM behavior; others argue that, at least informally, they fit the observable capabilities.