2024-12-22

GPT-5 is behind schedule

Model quality and perceived progress

Some participants say o1-Pro and o3 feel like major leaps over GPT‑4, especially on reasoning-heavy tasks; others report little or no improvement over last year’s models for everyday questions.
Several feel early ChatGPT seemed better, suspecting either regression or rose‑tinted memory.
There’s disagreement over which vendor leads: many praise Claude 3.5 Sonnet for coding and general use; others prefer GPT‑4o or o1 for math and logic. Gemini 2.0 gets mixed reviews.

Coding and developer workflows

Experiences are sharply split:
- Positive: great for boilerplate, small well-specified functions, porting between frameworks, refactoring messy notebooks into cleaner code, writing DSL parsers, protocol implementations, or exploring unfamiliar languages.
- Negative: frequent hallucinated APIs, subtle bugs (off‑by‑one, race conditions), inability to handle complex, proprietary codebases, and loops of wrong fixes.
Best results come when:
- Breaking problems into standard subproblems.
- Treating the model like a junior colleague and iterating based on review.
- Using RAG/projects with full repo context.
Benchmarks vs. reality is debated; some see o1/o3 excelling in tests but underperforming for day‑to‑day coding relative to GPT‑4 or Claude.

Use cases and productivity

Strong enthusiasm for:
- Fast learning and tutoring conversations.
- Text work: summarization, translation, grammar, and proposal/report drafting.
- Niche productivity wins (e.g., interpreting medical tests, product ideation, protocol reverse‑engineering, math-heavy research support).
Some users report no meaningful value despite repeated attempts.

Reliability, hallucinations, and confidence

Many examples of models being confidently wrong, inventing references, and then “gaslighting” when challenged.
Desire for models that:
- Ask clarifying questions by default.
- Expose a trustworthy “I don’t know” or uncertainty signal.
Some argue constant skepticism and verification are essential; others worry about long‑term skill atrophy and “black box” codebases.

Scaling, data, and synthetic training

Thread notes:
- Public text data is near-exhausted; higher‑quality, non‑public or synthetic data is now the bottleneck.
- o‑series (o1/o3) are seen as attempts to turn compute into better training data via reasoning traces.
Concerns:
- Massive training/inference costs (especially for o3‑high) vs. modest benchmark gains.
- Risks of compounding bias and overfitting when training on synthetic data generated by earlier models.
Others counter that:
- Inference‑time scaling (longer “thinking”, MCTS‑style methods) is a genuine new lever.
- Synthetic data works in domains with clear correctness checks (math, coding).

Economics, AGI, and expectations

Some think “more is more” scaling is hitting diminishing returns; GPT‑5’s delay is taken as evidence.
Others see steady benchmark progress and new reasoning models as signs that slowdown talk is premature.
Debate over:
- Whether LLMs are a path to AGI vs. just powerful “fancy autocomplete”.
- Whether a small temporal lead in AGI would confer decisive, even existential, advantages.
- Sustainability of current valuations given huge burn and unclear high‑margin use cases.

Data ownership and web scraping

Growing resistance to AI training:
- A notable fraction of top sites reportedly block AI crawlers.
- Fears that long‑tail, high‑quality content will be withheld unless creators are compensated.
Legal outcome on fair use and training remains unresolved and is seen as potentially pivotal for future model advances.

Agents, integration, and robotics

Many expect the next big gains from:
- Better orchestration: projects, tools, RAG, “panel of experts” agents, and workflow‑aware assistants.
- Domain‑specific systems trained on internal code, docs, and standards.
Robotics is discussed as a longer‑term frontier: LLMs + vision models are transforming research, but commercial, safe home robots are still seen as distant.

Related topics