Measuring Claude 4.7's tokenizer costs
Tokenization changes and cost impact
- Multiple users report 20–35% more tokens for similar tasks with Opus 4.7, especially on code and large-context work.
- Some subscription users hit weekly limits in a day; API users see this as a direct price hike, while cached input can soften the blow.
- Confusion persists over Anthropic’s “usage limits” messaging vs underlying per-token billing.
Perceived quality: Opus 4.7 vs 4.6
- Some see 4.7 as a clear upgrade: higher one‑shot success rate, better instruction following, fewer irrelevant tangents, stronger coding and planning.
- Others report regressions: more hallucinations (including fake tools), refusal to modify benign code due to malware checks, getting stuck in “side quests,” and worse results on established internal tasks.
- A few controlled tests claim:
- 4.7 ≈ “old” 4.6 cost, ~20% costlier than “new” (apparently throttled) 4.6.
- In at least one domain benchmark, 4.7 was both cheaper and more accurate due to shorter reasoning chains.
- Several users feel Opus 4.6 was silently “nerfed” prior to 4.7’s release.
Effort levels, reasoning, and compaction
- Anthropic added 5 “effort” modes; many find this confusing and suspect it increases the chance users overpay.
- Docs now recommend xhigh (not max) for coding/agentic use; max is described as prone to overthinking and heavy token use.
- Aggressive context compaction introduces multi‑minute pauses and more tool calls, which users experience as both latency and hidden cost.
Pricing, incentives, and “enshittification” concerns
- Some argue higher costs are inevitable: compute is expensive, VC subsidies are ending, and enterprise demand is strong.
- Others see “shrinkflation”: more tokens, vaguely described improvements, nerfed old models, and frequent new “flagship” releases as a way to extract more revenue.
- There is debate over whether public‑company pressures will push Anthropic toward profit maximization at the expense of user alignment and transparency.
Benchmarks, measurement, and A/B testing
- Users note that common benchmarks have high variance, are easy to game, and often lack sufficient sample sizes to detect small improvements.
- Claims that Anthropic A/B‑tested 4.6 vs 4.7 in production and reduced 4.6’s “thinking” to free capacity feed degradation and conspiracy narratives.
- Many emphasize that what really matters is “cost per successful task,” not cost per token or per session, but this is hard and expensive to measure.
Open-source and local models
- Several participants are moving some work to open models (Qwen, Gemma, GLM) via local or third‑party hosting, citing cost control and predictable behavior.
- Consensus: open models are improving fast and are “good enough” for many tasks, but still lag top proprietary models for complex, high‑stakes coding and agentic workflows.
- Hardware requirements for truly frontier‑like local performance remain high; some warn local‑model enthusiasts are overselling current capabilities.
Workflow and model selection
- Growing view that teams should “right‑size” models: small/cheap for rote implementation, larger models for planning, synthesis, and high‑risk tasks.
- Others counter that misjudging task complexity causes wasted runs: weaker models make a mess that then must be redone with stronger ones.
- Several note that human time (review, debugging, oversight) still dominates costs; until models are much more reliable, small price deltas per token are less important than accuracy and stability.