GPT-4.5
Model Positioning & Naming
- Seen as a very large, non‑reasoning “GPT‑4 successor” that would likely have been called GPT‑5 if it were more impressive.
- OpenAI frames it as their “last” big pre‑training model before unifying GPT (pre‑trained) and o‑series (reasoning) in a future GPT‑5.
- Some suspect it exists mainly to explore the limits of scaling pre‑training and to generate synthetic data, not as a mainstream product.
Capabilities & Benchmarks
- On coding benchmarks, 4.5 modestly beats GPT‑4o but is clearly behind o3‑mini and Claude 3.7 on several public and third‑party tests.
- On SWE‑Lancer, it beats o3‑mini but still trails Claude 3.5/3.7 in some settings, adding to confusion about its niche.
- On hallucination benchmarks like SimpleQA, it improves over earlier OpenAI models but is roughly matched or beaten by some Anthropic models; benchmark contamination and metric design are repeatedly questioned.
- A few independent tests (Kagi, custom creative‑writing and QA benchmarks) show meaningful but not revolutionary gains over 4o.
Price, Access & Performance
- API pricing (≈30× input and 15× output vs GPT‑4o) is widely called “insane,” “research‑only,” or “scarecrow pricing” meant to discourage heavy use.
- OpenAI itself says it may not keep 4.5 in the API long‑term; many see that as a warning against building products on it.
- Latency is much higher than 4o; some users report tens of seconds per response, making it unsuitable for interactive tools.
- Rollout is staggered: initially Pro/API only, with Plus users and broader tiers promised “after more GPUs arrive,” reinforcing a sense of GPU scarcity.
Use Cases: Coding, Writing & EQ
- Coding: multiple reports that o3‑mini and Claude 3.7 (often via Cursor/Claude‑Code) are significantly better for software work; 4.5 sometimes breaks simple tasks that cheaper models handle.
- Writing: many note clear improvement in tone, flow, and “naturalness” compared to 4o’s bullet‑point, corporate style; some find it the first OpenAI model that doesn’t “read like AI.”
- EQ / “vibes”: OpenAI showcases more empathetic, chatty answers; reactions split between “finally less robotic” and “therapy‑speak, manipulative, and infantilizing.”
- Several argue much of the EQ difference could be achieved by different system prompts on older models.
Safety, Hallucinations & Evaluations
- Debate over human preference evals: some see them as optimizing for obsequiousness over truth; others reply that complex correctness is inherently hard to measure without humans.
- Hallucination reductions are welcomed but 30–40% error on hard QA is still viewed as far from trustworthy; some insist LLMs remain tools whose output must be verified.
Business, Strategy & Hype Cycle
- Many call the release underwhelming and rushed or, conversely, something OpenAI sat on because it couldn’t be made to look like a big leap.
- The high cost and modest gains reinforce a narrative that simple parameter scaling has hit diminishing returns; reasoning models (o1/o3, DeepSeek R1, Claude “thinking”) are seen as the real frontier.
- Broader skepticism grows around AGI timelines, OpenAI’s valuation, and the sustainability of GPU‑burning megamodels, with multiple commenters predicting an AI “plateau” or bubble deflation even as LLMs remain practically useful.