2024-09-13

Notes on OpenAI's new o1 chain-of-thought models

Model design and training paradigm

Many speculate o1 is essentially a GPT‑4‑class base model wrapped in an agentic loop that iteratively “thinks,” backtracks, and re-prompts itself.
Some see this as a generalized version of techniques already used in LangChain/DSPy/agent frameworks; others argue the RL and orchestration here are non‑trivial.
There is discussion of o1/o1‑mini being used to generate synthetic chain‑of‑thought data to train “GPT‑5/Orion,” enabling a ladder of self‑bootstrapping models.
The hidden reasoning tokens are believed to be partly about preventing competitors from scraping CoT traces and partly about enabling unaligned internal reasoning that gets post‑filtered.

Capabilities, reasoning, and hallucinations

Users report clear gains on multi‑step math, programming competitions, logic puzzles, and some code refactoring tasks; o1‑mini often shines on symbolic problems with less world knowledge.
At the same time, many examples show o1‑preview still hallucinating APIs, legal rules, game logic, chess facts, and basic math/logic, sometimes with long but wrong chains-of-thought.
Some say it’s meaningfully better than GPT‑4o in “hard” tasks if you keep pushing it; others find minimal real‑world improvement relative to extra cost and latency.
Several note it’s worse than earlier models at obscure factual recall; it remains “a lossy compressed database plus pattern matcher,” not a reliable encyclopedia.

Developer experience, tooling, and pricing

Hidden reasoning tokens worry developers: harder to debug where reasoning went wrong, and users pay for tokens they can’t see or verify.
Some accept this as long as average call cost is budget‑friendly; others see risk of opaque, unbounded price changes and call it anti‑competitive.
Lack of tools, system prompts, streaming and multimodality (for now) is attributed to beta status; many expect future o1 variants with tools, RAG, code execution, and multimodal CoT.

Impact on programming work

Several engineers say current LLMs (Claude, GPT‑4o, Cursor, etc.) already generate most of their day‑to‑day code; o1 is viewed as another step, not a revolution.
Others report LLMs failing on the exact hard problems where they need help, or requiring so much babysitting that any productivity gain disappears.
There’s debate on whether agentic, reasoning‑heavy models will make a single engineer as productive as “five,” or just flood the world with more brittle, barely‑understood software.

Benchmarks, hype, and terminology

Strong scores on AIME/GPQA and coding benchmarks are contrasted with users saying “I can’t feel the difference” in normal workflows.
Some see safety talk and “PhD‑level reasoning” claims as marketing that overstates real capabilities; others note incremental but genuine gains.
Multiple commenters object to terms like “reasoning,” “thought,” and “intelligence” for LLM behavior; others argue that, at least informally, they fit the observable capabilities.

Related topics