Notes on OpenAI's new o1 chain-of-thought models
Model design and training paradigm
- Many speculate o1 is essentially a GPT‑4‑class base model wrapped in an agentic loop that iteratively “thinks,” backtracks, and re-prompts itself.
- Some see this as a generalized version of techniques already used in LangChain/DSPy/agent frameworks; others argue the RL and orchestration here are non‑trivial.
- There is discussion of o1/o1‑mini being used to generate synthetic chain‑of‑thought data to train “GPT‑5/Orion,” enabling a ladder of self‑bootstrapping models.
- The hidden reasoning tokens are believed to be partly about preventing competitors from scraping CoT traces and partly about enabling unaligned internal reasoning that gets post‑filtered.
Capabilities, reasoning, and hallucinations
- Users report clear gains on multi‑step math, programming competitions, logic puzzles, and some code refactoring tasks; o1‑mini often shines on symbolic problems with less world knowledge.
- At the same time, many examples show o1‑preview still hallucinating APIs, legal rules, game logic, chess facts, and basic math/logic, sometimes with long but wrong chains-of-thought.
- Some say it’s meaningfully better than GPT‑4o in “hard” tasks if you keep pushing it; others find minimal real‑world improvement relative to extra cost and latency.
- Several note it’s worse than earlier models at obscure factual recall; it remains “a lossy compressed database plus pattern matcher,” not a reliable encyclopedia.
Developer experience, tooling, and pricing
- Hidden reasoning tokens worry developers: harder to debug where reasoning went wrong, and users pay for tokens they can’t see or verify.
- Some accept this as long as average call cost is budget‑friendly; others see risk of opaque, unbounded price changes and call it anti‑competitive.
- Lack of tools, system prompts, streaming and multimodality (for now) is attributed to beta status; many expect future o1 variants with tools, RAG, code execution, and multimodal CoT.
Impact on programming work
- Several engineers say current LLMs (Claude, GPT‑4o, Cursor, etc.) already generate most of their day‑to‑day code; o1 is viewed as another step, not a revolution.
- Others report LLMs failing on the exact hard problems where they need help, or requiring so much babysitting that any productivity gain disappears.
- There’s debate on whether agentic, reasoning‑heavy models will make a single engineer as productive as “five,” or just flood the world with more brittle, barely‑understood software.
Benchmarks, hype, and terminology
- Strong scores on AIME/GPQA and coding benchmarks are contrasted with users saying “I can’t feel the difference” in normal workflows.
- Some see safety talk and “PhD‑level reasoning” claims as marketing that overstates real capabilities; others note incremental but genuine gains.
- Multiple commenters object to terms like “reasoning,” “thought,” and “intelligence” for LLM behavior; others argue that, at least informally, they fit the observable capabilities.