GPT-5 is behind schedule

Model quality and perceived progress

  • Some participants say o1-Pro and o3 feel like major leaps over GPT‑4, especially on reasoning-heavy tasks; others report little or no improvement over last year’s models for everyday questions.
  • Several feel early ChatGPT seemed better, suspecting either regression or rose‑tinted memory.
  • There’s disagreement over which vendor leads: many praise Claude 3.5 Sonnet for coding and general use; others prefer GPT‑4o or o1 for math and logic. Gemini 2.0 gets mixed reviews.

Coding and developer workflows

  • Experiences are sharply split:
    • Positive: great for boilerplate, small well-specified functions, porting between frameworks, refactoring messy notebooks into cleaner code, writing DSL parsers, protocol implementations, or exploring unfamiliar languages.
    • Negative: frequent hallucinated APIs, subtle bugs (off‑by‑one, race conditions), inability to handle complex, proprietary codebases, and loops of wrong fixes.
  • Best results come when:
    • Breaking problems into standard subproblems.
    • Treating the model like a junior colleague and iterating based on review.
    • Using RAG/projects with full repo context.
  • Benchmarks vs. reality is debated; some see o1/o3 excelling in tests but underperforming for day‑to‑day coding relative to GPT‑4 or Claude.

Use cases and productivity

  • Strong enthusiasm for:
    • Fast learning and tutoring conversations.
    • Text work: summarization, translation, grammar, and proposal/report drafting.
    • Niche productivity wins (e.g., interpreting medical tests, product ideation, protocol reverse‑engineering, math-heavy research support).
  • Some users report no meaningful value despite repeated attempts.

Reliability, hallucinations, and confidence

  • Many examples of models being confidently wrong, inventing references, and then “gaslighting” when challenged.
  • Desire for models that:
    • Ask clarifying questions by default.
    • Expose a trustworthy “I don’t know” or uncertainty signal.
  • Some argue constant skepticism and verification are essential; others worry about long‑term skill atrophy and “black box” codebases.

Scaling, data, and synthetic training

  • Thread notes:
    • Public text data is near-exhausted; higher‑quality, non‑public or synthetic data is now the bottleneck.
    • o‑series (o1/o3) are seen as attempts to turn compute into better training data via reasoning traces.
  • Concerns:
    • Massive training/inference costs (especially for o3‑high) vs. modest benchmark gains.
    • Risks of compounding bias and overfitting when training on synthetic data generated by earlier models.
  • Others counter that:
    • Inference‑time scaling (longer “thinking”, MCTS‑style methods) is a genuine new lever.
    • Synthetic data works in domains with clear correctness checks (math, coding).

Economics, AGI, and expectations

  • Some think “more is more” scaling is hitting diminishing returns; GPT‑5’s delay is taken as evidence.
  • Others see steady benchmark progress and new reasoning models as signs that slowdown talk is premature.
  • Debate over:
    • Whether LLMs are a path to AGI vs. just powerful “fancy autocomplete”.
    • Whether a small temporal lead in AGI would confer decisive, even existential, advantages.
    • Sustainability of current valuations given huge burn and unclear high‑margin use cases.

Data ownership and web scraping

  • Growing resistance to AI training:
    • A notable fraction of top sites reportedly block AI crawlers.
    • Fears that long‑tail, high‑quality content will be withheld unless creators are compensated.
  • Legal outcome on fair use and training remains unresolved and is seen as potentially pivotal for future model advances.

Agents, integration, and robotics

  • Many expect the next big gains from:
    • Better orchestration: projects, tools, RAG, “panel of experts” agents, and workflow‑aware assistants.
    • Domain‑specific systems trained on internal code, docs, and standards.
  • Robotics is discussed as a longer‑term frontier: LLMs + vision models are transforming research, but commercial, safe home robots are still seen as distant.