GPT-5 is behind schedule
Model quality and perceived progress
- Some participants say o1-Pro and o3 feel like major leaps over GPT‑4, especially on reasoning-heavy tasks; others report little or no improvement over last year’s models for everyday questions.
- Several feel early ChatGPT seemed better, suspecting either regression or rose‑tinted memory.
- There’s disagreement over which vendor leads: many praise Claude 3.5 Sonnet for coding and general use; others prefer GPT‑4o or o1 for math and logic. Gemini 2.0 gets mixed reviews.
Coding and developer workflows
- Experiences are sharply split:
- Positive: great for boilerplate, small well-specified functions, porting between frameworks, refactoring messy notebooks into cleaner code, writing DSL parsers, protocol implementations, or exploring unfamiliar languages.
- Negative: frequent hallucinated APIs, subtle bugs (off‑by‑one, race conditions), inability to handle complex, proprietary codebases, and loops of wrong fixes.
- Best results come when:
- Breaking problems into standard subproblems.
- Treating the model like a junior colleague and iterating based on review.
- Using RAG/projects with full repo context.
- Benchmarks vs. reality is debated; some see o1/o3 excelling in tests but underperforming for day‑to‑day coding relative to GPT‑4 or Claude.
Use cases and productivity
- Strong enthusiasm for:
- Fast learning and tutoring conversations.
- Text work: summarization, translation, grammar, and proposal/report drafting.
- Niche productivity wins (e.g., interpreting medical tests, product ideation, protocol reverse‑engineering, math-heavy research support).
- Some users report no meaningful value despite repeated attempts.
Reliability, hallucinations, and confidence
- Many examples of models being confidently wrong, inventing references, and then “gaslighting” when challenged.
- Desire for models that:
- Ask clarifying questions by default.
- Expose a trustworthy “I don’t know” or uncertainty signal.
- Some argue constant skepticism and verification are essential; others worry about long‑term skill atrophy and “black box” codebases.
Scaling, data, and synthetic training
- Thread notes:
- Public text data is near-exhausted; higher‑quality, non‑public or synthetic data is now the bottleneck.
- o‑series (o1/o3) are seen as attempts to turn compute into better training data via reasoning traces.
- Concerns:
- Massive training/inference costs (especially for o3‑high) vs. modest benchmark gains.
- Risks of compounding bias and overfitting when training on synthetic data generated by earlier models.
- Others counter that:
- Inference‑time scaling (longer “thinking”, MCTS‑style methods) is a genuine new lever.
- Synthetic data works in domains with clear correctness checks (math, coding).
Economics, AGI, and expectations
- Some think “more is more” scaling is hitting diminishing returns; GPT‑5’s delay is taken as evidence.
- Others see steady benchmark progress and new reasoning models as signs that slowdown talk is premature.
- Debate over:
- Whether LLMs are a path to AGI vs. just powerful “fancy autocomplete”.
- Whether a small temporal lead in AGI would confer decisive, even existential, advantages.
- Sustainability of current valuations given huge burn and unclear high‑margin use cases.
Data ownership and web scraping
- Growing resistance to AI training:
- A notable fraction of top sites reportedly block AI crawlers.
- Fears that long‑tail, high‑quality content will be withheld unless creators are compensated.
- Legal outcome on fair use and training remains unresolved and is seen as potentially pivotal for future model advances.
Agents, integration, and robotics
- Many expect the next big gains from:
- Better orchestration: projects, tools, RAG, “panel of experts” agents, and workflow‑aware assistants.
- Domain‑specific systems trained on internal code, docs, and standards.
- Robotics is discussed as a longer‑term frontier: LLMs + vision models are transforming research, but commercial, safe home robots are still seen as distant.