2026-04-23

GPT-5.5

Model quality & benchmarks

Many see GPT‑5.5 as an incremental but meaningful step over 5.4, especially for code, long-horizon tasks, and online research.
Benchmarks vs non‑OpenAI models spark interest: strong on TerminalBench and CyberGym; slightly behind Anthropic’s Opus 4.7/Mythos on SWE‑Bench Pro and some reasoning exams.
Some doubt benchmark value altogether, noting overfitting, memorization concerns (esp. SWE‑Bench) and lack of reproducibility.

Coding, agents & long-horizon work

Several developers report large practical gains: better repo understanding, architecture, performance optimization, and multi-step coding tasks.
Others complain about “motivation” problems in prior models (5.4 “stopping early” or being timid); 5.5 plus new Codex “heartbeats” are pitched as fixes for long-running workflows.
Mixed experiences: some say Opus 4.7 is now worse than 4.6 and feels more like GPT, while 5.5 feels sharper and more decisive for code; others still prefer Claude for precision and autonomy.

Performance, tokens & pricing

5.5 is ~2× the API price of 5.4 and substantially more than earlier GPT‑5.x and Chinese models.
OpenAI staff argue that token efficiency improved a lot: fewer tokens per successful task, so “cost per task” may drop even if “cost per token” rises.
Users worry subscription limits will be hit faster, especially with “thinking” modes and aggressive default settings (e.g., faster mode in Codex).

Safety, cyber and gating

5.5 ships with “stronger safeguards,” including stricter cyber classifiers and routed fallbacks to weaker models for risky activity.
Some practitioners praise Mythos-like cyber capability at near‑Mythos benchmark scores while being broadly accessible; others note gating via “trusted access” and ID verification for full cyber features.
Security researchers report warnings or bans when using MCP tools for malware/RE work; appeals are sometimes denied.

UX, rollout & ecosystem

Rollout is staggered (Pro/Enterprise first; Plus later), causing confusion and minor outages.
Some dislike product-forward strategy and fear future models may skip plain API access in favor of proprietary tools.
Debates over prompt “cargo culting” and over-pep-talked agents continue; several argue modern models need simpler, more concise prompts.

Meta: dependence, open models & evaluations

Multiple comments express unease at growing dependence on frontier coding agents and potential deskilling.
Others point to fast-rising open-weight models as a future safety valve on costs and lock‑in.
The “pelican on a bicycle” SVG test reappears as an informal, somewhat tongue‑in‑cheek visual benchmark; 5.5’s results are considered mediocre, fueling jokes and skepticism about real “intelligence.”

Related topics