GPT-5 for Developers

Availability & Rollout

  • Several developers reported GPT‑5 briefly appearing then disappearing in playgrounds and ChatGPT, suggesting a throttled, staggered rollout.
  • API access also came online gradually across orgs; some saw “model does not exist” errors before it propagated.

Benchmarks, Evals & Claims

  • Some commenters accused OpenAI of cherry‑picking τ2‑bench telecom scores over τ2‑bench airline, where GPT‑5 trails o3.
  • An OpenAI contributor explained telecom fixes brittle grading in airline/retail by scoring outcomes instead of single “reference” solutions, arguing it’s a better tool‑use eval.
  • Concern remains that current evals don’t capture context management or long‑running software tasks well.

Pricing, Routing & Model Variants

  • Many noted GPT‑5 is dramatically cheaper than Claude Opus and o3, sometimes even cheaper than GPT‑4.1, and speculated that this is the main achievement.
  • Confusion around routing: in ChatGPT there’s a router between “fast” and “deep reasoning” models, but API users must pick explicit models; no automatic routing there.
  • Some worry pricing could rise later once platform lock‑in grows.

Context Window & Long‑Running Tasks

  • Reported context is ~400k tokens (with differing input/output limits), larger than most competitors.
  • Multiple people stressed that large context ≠ effective use: context rot and degraded performance with “kitchen sink” prompts are still observed.
  • Real‑world workflows increasingly chunk work into many small tasks, clearing context often and using VCS/commits as external memory.

Coding & Agentic Performance

  • Experiences are mixed:
    • Several users say GPT‑5 (especially in Cursor) outperforms Opus/Sonnet and GPT‑4.1 on real coding tasks, long‑running issues, and tool use, sometimes solving problems prior models failed.
    • Others find Claude Code more reliable, especially for long‑lived projects, Elixir, or complex infra; some report GPT‑5 ignoring simple instructions and writing “junior‑esque” or odd code.
    • Latency can be very high in some IDE integrations, making GPT‑5 unusable for interactive assistance.

Tooling, Subscriptions & UX

  • Codex CLI now defaults to GPT‑5 and supports ChatGPT login (no per‑token billing), but its UX is widely described as inferior to Claude Code (permissions, terminal behavior, lack of images).
  • Many developers want a Claude‑Max‑style flat subscription for strong agentic harnesses; pay‑per‑token is seen as mentally and financially taxing for heavy use.

Structured Output & Hallucinations

  • The new context‑free grammar / regex‑constrained tool calls are widely viewed as one of the most exciting features, enabling stricter JSON/SQL/safe outputs.
  • Some early RAG and tool‑calling tests report significantly fewer hallucinations and better willingness to say “I don’t know,” which many see as a major practical improvement.

Expectations & AGI

  • A side discussion debates AGI timelines and whether LLMs are “just text predictors,” with views ranging from “LLMs are saturating benchmarks and that’s enough” to “this is clearly diminishing returns and not real intelligence.”