2025-02-27

GPT-4.5

Model Positioning & Naming

Seen as a very large, non‑reasoning “GPT‑4 successor” that would likely have been called GPT‑5 if it were more impressive.
OpenAI frames it as their “last” big pre‑training model before unifying GPT (pre‑trained) and o‑series (reasoning) in a future GPT‑5.
Some suspect it exists mainly to explore the limits of scaling pre‑training and to generate synthetic data, not as a mainstream product.

Capabilities & Benchmarks

On coding benchmarks, 4.5 modestly beats GPT‑4o but is clearly behind o3‑mini and Claude 3.7 on several public and third‑party tests.
On SWE‑Lancer, it beats o3‑mini but still trails Claude 3.5/3.7 in some settings, adding to confusion about its niche.
On hallucination benchmarks like SimpleQA, it improves over earlier OpenAI models but is roughly matched or beaten by some Anthropic models; benchmark contamination and metric design are repeatedly questioned.
A few independent tests (Kagi, custom creative‑writing and QA benchmarks) show meaningful but not revolutionary gains over 4o.

Price, Access & Performance

API pricing (≈30× input and 15× output vs GPT‑4o) is widely called “insane,” “research‑only,” or “scarecrow pricing” meant to discourage heavy use.
OpenAI itself says it may not keep 4.5 in the API long‑term; many see that as a warning against building products on it.
Latency is much higher than 4o; some users report tens of seconds per response, making it unsuitable for interactive tools.
Rollout is staggered: initially Pro/API only, with Plus users and broader tiers promised “after more GPUs arrive,” reinforcing a sense of GPU scarcity.

Use Cases: Coding, Writing & EQ

Coding: multiple reports that o3‑mini and Claude 3.7 (often via Cursor/Claude‑Code) are significantly better for software work; 4.5 sometimes breaks simple tasks that cheaper models handle.
Writing: many note clear improvement in tone, flow, and “naturalness” compared to 4o’s bullet‑point, corporate style; some find it the first OpenAI model that doesn’t “read like AI.”
EQ / “vibes”: OpenAI showcases more empathetic, chatty answers; reactions split between “finally less robotic” and “therapy‑speak, manipulative, and infantilizing.”
Several argue much of the EQ difference could be achieved by different system prompts on older models.

Safety, Hallucinations & Evaluations

Debate over human preference evals: some see them as optimizing for obsequiousness over truth; others reply that complex correctness is inherently hard to measure without humans.
Hallucination reductions are welcomed but 30–40% error on hard QA is still viewed as far from trustworthy; some insist LLMs remain tools whose output must be verified.

Business, Strategy & Hype Cycle

Many call the release underwhelming and rushed or, conversely, something OpenAI sat on because it couldn’t be made to look like a big leap.
The high cost and modest gains reinforce a narrative that simple parameter scaling has hit diminishing returns; reasoning models (o1/o3, DeepSeek R1, Claude “thinking”) are seen as the real frontier.
Broader skepticism grows around AGI timelines, OpenAI’s valuation, and the sustainability of GPU‑burning megamodels, with multiple commenters predicting an AI “plateau” or bubble deflation even as LLMs remain practically useful.

Related topics