OpenAI Progress

Perceived progress and what’s missing from the chart

  • Many note conspicuous omissions: no original GPT‑3/ChatGPT, no GPT‑4o, o1, or o3. Several speculate this makes the jump from early GPT‑4 straight to GPT‑5 look larger than it feels in practice.
  • Different people peg the “biggest leap” at different places: GPT‑1→2 (first time it felt qualitatively new), 3→3.5 (first usable “ChatGPT”), 3.5→4 (from toy to broadly useful), or 4→o1 (reasoning/math paradigm shift). GPT‑5 and o3 are widely described as incremental.

Creative writing and “personality”

  • A strong contingent prefers GPT‑1/2 and text‑davinci‑001 for stories and poems: shorter, weirder, more evocative, less “corporate.” Newer models are called polished but bland.
  • The 50‑word sentient toaster story splits readers: some think GPT‑5’s version is structurally better and follows instructions; others find it sterile next to davinci’s incomplete but atmospheric output.
  • Similar reactions to limericks: later models are better at formal constraints but less surprising. Several blame RLHF and safety tuning for sanding off creativity.
  • Others say GPT‑4.1 or 4.5 are currently the best creative writers, and note that adding explicit style constraints (“make it evocative and weird”) still produces striking text.

Usefulness, coding, and reasoning

  • Many report GPT‑5 as a regression for coding: more unnecessary edits, odd mistakes (e.g., language APIs), problems with long markdown or regex, and weaker coherence than 4o or Claude in real workflows.
  • Others find GPT‑5 a big win for tool use, predictability, and “doing exactly what you asked,” especially with structured outputs.
  • Several users highlight o1/o3 as the real jump for math/physics and “one‑shot” app building; 4o’s main value was speed, multimodality, and voice.

Facts, search, and reliability

  • There’s a long argument over using LLMs for fact checking. One side claims GPT‑5 “thinking” mode is highly accurate on non‑niche topics; others present coding and conceptual errors and warn against Gell‑Mann–style overtrust.
  • Many like LLMs as search front‑ends: clarifying vague queries, surfacing “unknown unknowns,” and cutting through SEO junk. But citations often 404 or don’t support the claims, and some users find this wastes time.
  • Several insist LLM answers must be treated like any other low‑reliability source: useful for ideas and links, not as a source of record.

Style, tone, and sycophancy

  • GPT‑5 is described as more “glazing,” flattering, and conversational, less likely to say “as an AI model…,” and more willing to answer authoritatively (e.g., tax questions) with only a soft suggestion to consult a professional.
  • Some find this creepy or dangerous, preferring earlier explicit disclaimers; others say it’s just UX evolution and can be adjusted via personality settings.
  • Verbosity is a common complaint: GPT‑4/5 answers are seen as overlong compared to davinci‑001’s concise completions.

Benchmarks vs lived experience and plateau debate

  • One side cites LMsys, LiveBench, IQ‑style tests, IMO gold, and “vibe coding” as evidence that state‑of‑the‑art has advanced dramatically even in the last year.
  • Skeptics counter that benchmarks are easily gamed and that everyday experience—especially with GPT‑5’s release—feels like stagnation or regression, with more PR gloss than clear qualitative gain.
  • A meta‑thread discusses Amara’s law and “threshold” effects: early jumps from useless→OK feel huge, later OK→better changes feel small. Some argue we’re nearing a transformer plateau; others think progress is still rapid but increasingly hard for non‑experts to perceive.