2025-08-07

GPT-5

Coding performance and model comparisons

Many developers say Anthropic’s Claude (Sonnet 3.7/4 and Claude Code) still feels best for day‑to‑day work: refactors, non‑trivial feature builds, understanding existing code/data models, test planning, and tool use in IDEs.
Others argue Gemini and o3 produce higher‑quality code if you can feed full context non‑agentically, whereas Claude excels at speed and agentic workflows but can quietly introduce bad design and regressions.
Early GPT‑5 coding examples and the official repo are viewed as “months behind” what Claude Code already demonstrated; demos focus on greenfield JS apps, which some consider a very easy case.
Models still perform poorly on niche languages and atypical stacks (OCaml, C# Avalonia, Mathematica, SageMath, custom concurrency patterns), limiting usefulness on legacy or non‑web systems.

Perceived advances and emerging limits

Benchmarks (SWE‑bench Verified ~75%, Aider Polyglot ~88%) show small gains over o3 and GPT‑4.x; several commenters say the jump feels more like “GPT‑4.2” than a true new generation.
Many see this as evidence we’re on the flattening part of the S‑curve for LLMs: big leaps from GPT‑3→4, then diminishing, expensive improvements. Others think breakthroughs in reasoning or new architectures could still appear.
The big concrete wins noted: lower cost vs prior reasoning models, larger context (up to 400k), integrated “thinking” mode, better routing between fast and slow reasoning, and reduced hallucinations on OpenAI’s internal evals.

Hype, AGI rhetoric, and trust

There’s broad irritation at continued “AGI soon” and “PhD‑level” language when the launch demo itself repeats well‑known misconceptions (e.g., incorrect Bernoulli/airfoil explanation) and still hallucinates or over‑confidently reasons.
Some see GPT‑5 as further proof LLMs alone won’t reach AGI; others argue current progress is still impressive but far from the existential claims made over the last two years.
This fuels both job anxiety (especially among web/frontend devs) and a counter‑desire that AI under‑deliver to prevent mass displacement.

Launch presentation, product decisions, and access

The livestream is widely criticized as dry, over‑scripted, and marred by obvious chart errors (mis‑scaled bars for SWE‑bench, deception rates), reinforcing perceptions of “vibe‑driven” marketing.
OpenAI only compared against its own models; lack of direct numbers vs Claude/Gemini/Qwen is noted.
Deprecating previous GPT‑4.x/o‑series models inside ChatGPT and pushing a unified GPT‑5 system is seen as simplification by some, lock‑in and control by others.
Mandatory ID + selfie verification for GPT‑5 API access is a major flashpoint, especially for users in sensitive domains (e.g., biology) already frustrated by aggressive safety filters on legitimate expert work.

Evaluations and desired real‑world tests

Several participants say existing benchmarks (IMO, pelican‑on‑a‑bike, toy apps) are now weak or easily overfit; they want evals on long‑horizon, multi‑step engineering tasks and large‑codebase refactors without losing the plot.
Early third‑party tests are mixed: some report strong long‑context coding and tool use; others see only modest, hard‑to‑feel improvements over top competitors.

Related topics