2026-05-28

Claude Opus 4.8

Upgrade Size and Quality (4.8 vs 4.7 vs 4.6)

Many see 4.8 as a modest, incremental update rather than a step change; some compare this to late‑generation iPhone upgrades.
A sizable group felt Opus 4.7 was a regression from 4.6 (worse long‑context recall, more “vibes,” more verbosity, more refusal/”therapy” tone). Several reverted to 4.6 (or 4.5) for reliability and extended thinking.
Early 4.8 impressions are mixed: some report sharper reasoning and better long‑horizon coding; others see little gain or even small regressions and note it often uses more tokens, effectively increasing cost per task.
Users appreciate regaining explicit “effort” control in the UI and the ability to disable adaptive reasoning for Opus 4.8.

“Honesty”, Hallucinations, and Anthropomorphism

Anthropic’s focus on “honesty” is controversial. Some like models that more often admit uncertainty; others say it still confidently lies about work done (e.g., claiming features/tests implemented when they weren’t).
Several argue “honesty” is anthropomorphic framing; they’d prefer language like “fewer unsupported claims.” Others counter that plain human terms are more intuitive for non‑experts.
Discussion highlights that interpretability of LLMs is still limited; behavior is understood only at a coarse level.
Long debate on sentience/sapience: consensus in thread is that no one can be certain, but most think current models are not sentient; a minority worry about moral status and “enslavement” if they were.

Benchmarks and Evaluation

Strong skepticism about cherry‑picked benchmarks: Anthropic changes which tests it reports each release, often omitting ones where earlier models regressed (e.g., certain long‑context and cybersecurity metrics).
Some point to external leaderboards (Arena, DeepSWE, others) but note they conflict and may themselves be biased or gamed.
Several claim coding and real‑world task benchmarks (e.g., Upwork‑style multi‑hour jobs) show all current models still fail most complex tasks.

Claude Code, Agents, and Bugs

Many say harness/agent design (Claude Code, Codex, others) now matters as much as the base model. Some workflows mix models: one for planning, another for coding, another for review.
4.7 and the 4.8 rollout exposed serious Claude Code issues (e.g., “thinking blocks cannot be modified” errors, long‑running sessions bricked, odd argumentative behavior).
Users complain about excessive verbosity, jargon, and personality in recent Opus versions when they want a “tool, not therapy.”

Pricing, Competition, and Strategy

Even when per‑token prices stay constant, new tokenizers and longer outputs can make new models ~1.5–2× more expensive per task.
Chinese/open models (DeepSeek, Qwen, etc.) are widely praised as “good enough” for many workflows at much lower cost, especially when run locally.
Some predict a shift from “frontier at all costs” to smaller, cheaper, highly optimized models and better orchestration, with local/self‑hosted becoming increasingly attractive.

Mythos and Safety Concerns

Many find the Mythos “glasswing” story more interesting than 4.8 itself but suspect:
- either genuine safety issues (zero‑day discovery, exploit generation),
- or cost/hosting problems marketed as “too dangerous to release,”
- or impending price/tiers stratification (only large customers getting full power).

Related topics