2025-08-07

GPT-5: Key characteristics, pricing and system card

System cards, benchmarks, and transparency

“System card” is seen by some as marketing jargon akin to a product sheet; others note labs use it for safety/eval reporting but with fewer training details than early “model cards.”
Commenters complain about missing fundamentals (e.g. parameter counts, full benchmark tables) and say that without them it’s hard to reason about scaling, limits, and what actually improved.
Some criticize the writeup as largely restating OpenAI PR, with no independent benchmarks yet.

Safety, fairness, and METR autonomy evals

OpenAI’s fairness section (e.g. relying heavily on BBQ) is viewed as thin for a model used in hiring, education, and business.
People note that industries mostly do not build their own evals; AI labs and open‑source devs dominate that space.
The METR report (≈2h15m task length at 50% success) is debated: some say it’s in the scary regime for “AI 2027” forecasts; others note it was slightly below prediction markets’ median expectations.
Several doubt that task-duration curves are a robust metric for autonomy or danger.

Training data, knowledge cutoff, and copyright

The September 2024 cutoff (earlier than some competitors) prompts speculation: is it due to processing/filtering time, copyright sensitivity, or concern about AI‑generated web data polluting training?
There’s extended debate over OpenAI’s claim not to train on paid API data, with some trusting legal/enterprise pressure and others assuming they’ll secretly use it, given their stance on web‑scraped copyrighted content.

Pricing, competition, and product lineup

GPT‑5 is described as “Opus‑class at a fraction of the cost”; aggressive pricing is read as a response to tough competition (especially in the API market) rather than a sign of a moat.
Some suspect below‑cost pricing; others think distillation and architectural efficiency just made inference cheap.
New lineup: three sizes (regular/mini/nano) each with four reasoning levels (minimal/low/medium/high). Some find this more structured; others see choice overload and worry about constant “tune the model vs tune the prompt” dilemmas.
ChatGPT uses an internal router to choose models/reasoning levels; the API exposes raw knobs so devs must benchmark and decide themselves.

Reasoning modes, sampling controls, and tools

Reasoning effort is framed as “test‑time scaling”: more compute per query instead of larger weights. Users report big behavioral differences between low/medium/high.
Removal of temperature/top‑p controls for reasoning models frustrates some, who rely on low‑variance settings. One commenter claims flexible samplers complicate safety/alignment.
Others note that for many use cases, you can just default to “largest model + highest reasoning” when cost isn’t critical.

Reliability, hallucinations, and sycophancy

OpenAI claims reduced hallucinations and sycophancy; several users say GPT‑5 feels more direct and less flattering than prior models, and more willing to “just do the task.”
However, many report frequent factual and logical errors in everyday use (code, proofreading, JSON, dashboards), including during OpenAI’s own demos.
Long subthread argues over what counts as a “hallucination” vs a “dumb mistake”; some reserve the term for fabricated external facts, others for any confidently wrong output. Consensus: whatever the label, users must still double‑check important answers.
Models often crumble or over‑accommodate when told “you’re wrong,” though there are hints that newer safety training rewards them for politely holding their ground in some cases.

Capabilities, AGI prospects, and scaling debates

Some are underwhelmed: given years of GPT‑5 hype, this feels like a strong but incremental upgrade, not a “world‑shattering” leap.
Others argue that, compared to GPT‑4 two years ago, the cumulative progress (reasoning models, tool use, multimodal) is enormous; incremental steps are preferable to “fast takeoff.”
There is extensive debate over whether LLMs can ever reach AGI:
- Skeptics emphasize static weights, lack of persistent self‑modification, limited context windows, and inability to truly learn from ongoing experience.
- Defenders say external memory, tools, and continual fine‑tuning could compensate, and that architecture alone doesn’t rule out AGI.
Several see “pure scaling maximalism” giving way to a focus on routing, specialized submodels, workflows, and tool ecosystems—interpreted either as healthy maturation or as signs of diminishing returns from just more data/compute.

Developer experience: coding, tools, and informal evals

Coding reviews are mixed: some users say GPT‑5 instantly enabled more advanced analysis and pipelines; others find it worse than earlier models or still too unreliable without strong tests and agentic loops.
Tool‑calling behavior seems more aggressive and sophisticated (e.g., fanning out multiple tools to gather context), with the cheap token pricing making that more acceptable.
There’s continued fascination with the “pelican on a bicycle in SVG” test: GPT‑5 still struggles, which many treat as a tangible, human‑legible gauge of progress and a reminder that evals can be gamed or overfit.

Related topics