2025-02-28

Hot take: GPT 4.5 is a nothing burger

How People Evaluate LLMs

Many argue there is no single objective metric for “better” models; benchmarks can be gamed and don’t track everyday usefulness.
Suggested approaches:
- Human pairwise comparison on specific tasks.
- Domain experts rating answers, not random raters.
- Personal “canaries”: a fixed set of prompts in domains you know deeply (coding, niche hobbies, technical explanations).
- Asking models to reason as devil’s advocate or under strict constraints.
Some emphasize that hallucinations and failures are often about user skill, prompting, and expectations.

Reception of GPT‑4.5

Broad sentiment: incremental improvement over GPT‑4/4o, with no headline new capability; “underwhelming” is common.
Specific positives:
- Slightly better at gluing together complex codebases and libraries.
- Some users find it better at sustained philosophical or argumentative dialogue, less people‑pleasing.
- Feels more human in cadence and nuance to some; a few say it crosses their “uncanny valley.”
Specific negatives:
- Worse than reasoning‑tuned or competing models (e.g., o3‑mini, DeepSeek‑R1) on coding, reasoning, and creativity, according to several users.
- Odd failure modes (e.g., bizarre word repetition loops) suggest rough edges.
- Many see it as only “slightly better” than cheaper competitors while being ~10–15x more expensive per token.

Diminishing Returns, Scaling Laws, and AGI

GPT‑4.5 is widely interpreted as evidence of diminishing returns from naive scaling of LLMs: more compute, marginal gains, rising cost.
Some argue this contradicts optimistic scaling‑law narratives (more compute → steady march to AGI); others say performance still tracks predictions but economic side (cost per gain) is breaking.
Strong skepticism that current LLM architecture alone leads to AGI; analogies to S‑curves, Moore’s law flattening, and past hype bubbles (blockchain, “big data,” metaverse).
Others counter that linear intelligence gains can still yield large economic impact, and that “AGI” is a moving goalpost—today’s systems already look like AGI relative to 2019 expectations.

Business Models, Hype, and Industry Dynamics

Multiple comments question the lack of robust, profitable business models given “half‑trillion‑dollar” scale spending.
Debate over whether foundation models are heading toward commoditization, with open or cheaper competitors (DeepSeek, Claude, Grok, etc.) eroding OpenAI’s edge.
Some see regulatory and safety rhetoric as partly a play for hype and regulatory capture rather than pure science.

Use Cases, Limits, and Human Perception

Users report real productivity wins (e.g., coding help, research assistance, editing coursework or journalism, philosophical exploration), but also emphasize:
- Persistent unreliability, non‑factuality, and weird edge‑case behavior.
- Very uneven performance across tasks; older or smaller models sometimes outperform frontier ones on narrow jobs.
There’s a split between people who experience these systems as almost person‑like (even feeling bad deleting chats) and those who see them as glorified, stochastic text tools whose “lifelike” feel is just human projection.

Related topics