GPT-4.5

Model Positioning & Naming

  • Seen as a very large, non‑reasoning “GPT‑4 successor” that would likely have been called GPT‑5 if it were more impressive.
  • OpenAI frames it as their “last” big pre‑training model before unifying GPT (pre‑trained) and o‑series (reasoning) in a future GPT‑5.
  • Some suspect it exists mainly to explore the limits of scaling pre‑training and to generate synthetic data, not as a mainstream product.

Capabilities & Benchmarks

  • On coding benchmarks, 4.5 modestly beats GPT‑4o but is clearly behind o3‑mini and Claude 3.7 on several public and third‑party tests.
  • On SWE‑Lancer, it beats o3‑mini but still trails Claude 3.5/3.7 in some settings, adding to confusion about its niche.
  • On hallucination benchmarks like SimpleQA, it improves over earlier OpenAI models but is roughly matched or beaten by some Anthropic models; benchmark contamination and metric design are repeatedly questioned.
  • A few independent tests (Kagi, custom creative‑writing and QA benchmarks) show meaningful but not revolutionary gains over 4o.

Price, Access & Performance

  • API pricing (≈30× input and 15× output vs GPT‑4o) is widely called “insane,” “research‑only,” or “scarecrow pricing” meant to discourage heavy use.
  • OpenAI itself says it may not keep 4.5 in the API long‑term; many see that as a warning against building products on it.
  • Latency is much higher than 4o; some users report tens of seconds per response, making it unsuitable for interactive tools.
  • Rollout is staggered: initially Pro/API only, with Plus users and broader tiers promised “after more GPUs arrive,” reinforcing a sense of GPU scarcity.

Use Cases: Coding, Writing & EQ

  • Coding: multiple reports that o3‑mini and Claude 3.7 (often via Cursor/Claude‑Code) are significantly better for software work; 4.5 sometimes breaks simple tasks that cheaper models handle.
  • Writing: many note clear improvement in tone, flow, and “naturalness” compared to 4o’s bullet‑point, corporate style; some find it the first OpenAI model that doesn’t “read like AI.”
  • EQ / “vibes”: OpenAI showcases more empathetic, chatty answers; reactions split between “finally less robotic” and “therapy‑speak, manipulative, and infantilizing.”
  • Several argue much of the EQ difference could be achieved by different system prompts on older models.

Safety, Hallucinations & Evaluations

  • Debate over human preference evals: some see them as optimizing for obsequiousness over truth; others reply that complex correctness is inherently hard to measure without humans.
  • Hallucination reductions are welcomed but 30–40% error on hard QA is still viewed as far from trustworthy; some insist LLMs remain tools whose output must be verified.

Business, Strategy & Hype Cycle

  • Many call the release underwhelming and rushed or, conversely, something OpenAI sat on because it couldn’t be made to look like a big leap.
  • The high cost and modest gains reinforce a narrative that simple parameter scaling has hit diminishing returns; reasoning models (o1/o3, DeepSeek R1, Claude “thinking”) are seen as the real frontier.
  • Broader skepticism grows around AGI timelines, OpenAI’s valuation, and the sustainability of GPU‑burning megamodels, with multiple commenters predicting an AI “plateau” or bubble deflation even as LLMs remain practically useful.