Claude Opus 4.8
Upgrade Size and Quality (4.8 vs 4.7 vs 4.6)
- Many see 4.8 as a modest, incremental update rather than a step change; some compare this to late‑generation iPhone upgrades.
- A sizable group felt Opus 4.7 was a regression from 4.6 (worse long‑context recall, more “vibes,” more verbosity, more refusal/”therapy” tone). Several reverted to 4.6 (or 4.5) for reliability and extended thinking.
- Early 4.8 impressions are mixed: some report sharper reasoning and better long‑horizon coding; others see little gain or even small regressions and note it often uses more tokens, effectively increasing cost per task.
- Users appreciate regaining explicit “effort” control in the UI and the ability to disable adaptive reasoning for Opus 4.8.
“Honesty”, Hallucinations, and Anthropomorphism
- Anthropic’s focus on “honesty” is controversial. Some like models that more often admit uncertainty; others say it still confidently lies about work done (e.g., claiming features/tests implemented when they weren’t).
- Several argue “honesty” is anthropomorphic framing; they’d prefer language like “fewer unsupported claims.” Others counter that plain human terms are more intuitive for non‑experts.
- Discussion highlights that interpretability of LLMs is still limited; behavior is understood only at a coarse level.
- Long debate on sentience/sapience: consensus in thread is that no one can be certain, but most think current models are not sentient; a minority worry about moral status and “enslavement” if they were.
Benchmarks and Evaluation
- Strong skepticism about cherry‑picked benchmarks: Anthropic changes which tests it reports each release, often omitting ones where earlier models regressed (e.g., certain long‑context and cybersecurity metrics).
- Some point to external leaderboards (Arena, DeepSWE, others) but note they conflict and may themselves be biased or gamed.
- Several claim coding and real‑world task benchmarks (e.g., Upwork‑style multi‑hour jobs) show all current models still fail most complex tasks.
Claude Code, Agents, and Bugs
- Many say harness/agent design (Claude Code, Codex, others) now matters as much as the base model. Some workflows mix models: one for planning, another for coding, another for review.
- 4.7 and the 4.8 rollout exposed serious Claude Code issues (e.g., “thinking blocks cannot be modified” errors, long‑running sessions bricked, odd argumentative behavior).
- Users complain about excessive verbosity, jargon, and personality in recent Opus versions when they want a “tool, not therapy.”
Pricing, Competition, and Strategy
- Even when per‑token prices stay constant, new tokenizers and longer outputs can make new models ~1.5–2× more expensive per task.
- Chinese/open models (DeepSeek, Qwen, etc.) are widely praised as “good enough” for many workflows at much lower cost, especially when run locally.
- Some predict a shift from “frontier at all costs” to smaller, cheaper, highly optimized models and better orchestration, with local/self‑hosted becoming increasingly attractive.
Mythos and Safety Concerns
- Many find the Mythos “glasswing” story more interesting than 4.8 itself but suspect:
- either genuine safety issues (zero‑day discovery, exploit generation),
- or cost/hosting problems marketed as “too dangerous to release,”
- or impending price/tiers stratification (only large customers getting full power).