GPT-5.5
Model quality & benchmarks
- Many see GPT‑5.5 as an incremental but meaningful step over 5.4, especially for code, long-horizon tasks, and online research.
- Benchmarks vs non‑OpenAI models spark interest: strong on TerminalBench and CyberGym; slightly behind Anthropic’s Opus 4.7/Mythos on SWE‑Bench Pro and some reasoning exams.
- Some doubt benchmark value altogether, noting overfitting, memorization concerns (esp. SWE‑Bench) and lack of reproducibility.
Coding, agents & long-horizon work
- Several developers report large practical gains: better repo understanding, architecture, performance optimization, and multi-step coding tasks.
- Others complain about “motivation” problems in prior models (5.4 “stopping early” or being timid); 5.5 plus new Codex “heartbeats” are pitched as fixes for long-running workflows.
- Mixed experiences: some say Opus 4.7 is now worse than 4.6 and feels more like GPT, while 5.5 feels sharper and more decisive for code; others still prefer Claude for precision and autonomy.
Performance, tokens & pricing
- 5.5 is ~2× the API price of 5.4 and substantially more than earlier GPT‑5.x and Chinese models.
- OpenAI staff argue that token efficiency improved a lot: fewer tokens per successful task, so “cost per task” may drop even if “cost per token” rises.
- Users worry subscription limits will be hit faster, especially with “thinking” modes and aggressive default settings (e.g., faster mode in Codex).
Safety, cyber and gating
- 5.5 ships with “stronger safeguards,” including stricter cyber classifiers and routed fallbacks to weaker models for risky activity.
- Some practitioners praise Mythos-like cyber capability at near‑Mythos benchmark scores while being broadly accessible; others note gating via “trusted access” and ID verification for full cyber features.
- Security researchers report warnings or bans when using MCP tools for malware/RE work; appeals are sometimes denied.
UX, rollout & ecosystem
- Rollout is staggered (Pro/Enterprise first; Plus later), causing confusion and minor outages.
- Some dislike product-forward strategy and fear future models may skip plain API access in favor of proprietary tools.
- Debates over prompt “cargo culting” and over-pep-talked agents continue; several argue modern models need simpler, more concise prompts.
Meta: dependence, open models & evaluations
- Multiple comments express unease at growing dependence on frontier coding agents and potential deskilling.
- Others point to fast-rising open-weight models as a future safety valve on costs and lock‑in.
- The “pelican on a bicycle” SVG test reappears as an informal, somewhat tongue‑in‑cheek visual benchmark; 5.5’s results are considered mediocre, fueling jokes and skepticism about real “intelligence.”