Claude Opus 4.8

Upgrade Size and Quality (4.8 vs 4.7 vs 4.6)

  • Many see 4.8 as a modest, incremental update rather than a step change; some compare this to late‑generation iPhone upgrades.
  • A sizable group felt Opus 4.7 was a regression from 4.6 (worse long‑context recall, more “vibes,” more verbosity, more refusal/”therapy” tone). Several reverted to 4.6 (or 4.5) for reliability and extended thinking.
  • Early 4.8 impressions are mixed: some report sharper reasoning and better long‑horizon coding; others see little gain or even small regressions and note it often uses more tokens, effectively increasing cost per task.
  • Users appreciate regaining explicit “effort” control in the UI and the ability to disable adaptive reasoning for Opus 4.8.

“Honesty”, Hallucinations, and Anthropomorphism

  • Anthropic’s focus on “honesty” is controversial. Some like models that more often admit uncertainty; others say it still confidently lies about work done (e.g., claiming features/tests implemented when they weren’t).
  • Several argue “honesty” is anthropomorphic framing; they’d prefer language like “fewer unsupported claims.” Others counter that plain human terms are more intuitive for non‑experts.
  • Discussion highlights that interpretability of LLMs is still limited; behavior is understood only at a coarse level.
  • Long debate on sentience/sapience: consensus in thread is that no one can be certain, but most think current models are not sentient; a minority worry about moral status and “enslavement” if they were.

Benchmarks and Evaluation

  • Strong skepticism about cherry‑picked benchmarks: Anthropic changes which tests it reports each release, often omitting ones where earlier models regressed (e.g., certain long‑context and cybersecurity metrics).
  • Some point to external leaderboards (Arena, DeepSWE, others) but note they conflict and may themselves be biased or gamed.
  • Several claim coding and real‑world task benchmarks (e.g., Upwork‑style multi‑hour jobs) show all current models still fail most complex tasks.

Claude Code, Agents, and Bugs

  • Many say harness/agent design (Claude Code, Codex, others) now matters as much as the base model. Some workflows mix models: one for planning, another for coding, another for review.
  • 4.7 and the 4.8 rollout exposed serious Claude Code issues (e.g., “thinking blocks cannot be modified” errors, long‑running sessions bricked, odd argumentative behavior).
  • Users complain about excessive verbosity, jargon, and personality in recent Opus versions when they want a “tool, not therapy.”

Pricing, Competition, and Strategy

  • Even when per‑token prices stay constant, new tokenizers and longer outputs can make new models ~1.5–2× more expensive per task.
  • Chinese/open models (DeepSeek, Qwen, etc.) are widely praised as “good enough” for many workflows at much lower cost, especially when run locally.
  • Some predict a shift from “frontier at all costs” to smaller, cheaper, highly optimized models and better orchestration, with local/self‑hosted becoming increasingly attractive.

Mythos and Safety Concerns

  • Many find the Mythos “glasswing” story more interesting than 4.8 itself but suspect:
    • either genuine safety issues (zero‑day discovery, exploit generation),
    • or cost/hosting problems marketed as “too dangerous to release,”
    • or impending price/tiers stratification (only large customers getting full power).