2026-04-17

Measuring Claude 4.7's tokenizer costs

Tokenization changes and cost impact

Multiple users report 20–35% more tokens for similar tasks with Opus 4.7, especially on code and large-context work.
Some subscription users hit weekly limits in a day; API users see this as a direct price hike, while cached input can soften the blow.
Confusion persists over Anthropic’s “usage limits” messaging vs underlying per-token billing.

Perceived quality: Opus 4.7 vs 4.6

Some see 4.7 as a clear upgrade: higher one‑shot success rate, better instruction following, fewer irrelevant tangents, stronger coding and planning.
Others report regressions: more hallucinations (including fake tools), refusal to modify benign code due to malware checks, getting stuck in “side quests,” and worse results on established internal tasks.
A few controlled tests claim:
- 4.7 ≈ “old” 4.6 cost, ~20% costlier than “new” (apparently throttled) 4.6.
- In at least one domain benchmark, 4.7 was both cheaper and more accurate due to shorter reasoning chains.
Several users feel Opus 4.6 was silently “nerfed” prior to 4.7’s release.

Effort levels, reasoning, and compaction

Anthropic added 5 “effort” modes; many find this confusing and suspect it increases the chance users overpay.
Docs now recommend xhigh (not max) for coding/agentic use; max is described as prone to overthinking and heavy token use.
Aggressive context compaction introduces multi‑minute pauses and more tool calls, which users experience as both latency and hidden cost.

Pricing, incentives, and “enshittification” concerns

Some argue higher costs are inevitable: compute is expensive, VC subsidies are ending, and enterprise demand is strong.
Others see “shrinkflation”: more tokens, vaguely described improvements, nerfed old models, and frequent new “flagship” releases as a way to extract more revenue.
There is debate over whether public‑company pressures will push Anthropic toward profit maximization at the expense of user alignment and transparency.

Benchmarks, measurement, and A/B testing

Users note that common benchmarks have high variance, are easy to game, and often lack sufficient sample sizes to detect small improvements.
Claims that Anthropic A/B‑tested 4.6 vs 4.7 in production and reduced 4.6’s “thinking” to free capacity feed degradation and conspiracy narratives.
Many emphasize that what really matters is “cost per successful task,” not cost per token or per session, but this is hard and expensive to measure.

Open-source and local models

Several participants are moving some work to open models (Qwen, Gemma, GLM) via local or third‑party hosting, citing cost control and predictable behavior.
Consensus: open models are improving fast and are “good enough” for many tasks, but still lag top proprietary models for complex, high‑stakes coding and agentic workflows.
Hardware requirements for truly frontier‑like local performance remain high; some warn local‑model enthusiasts are overselling current capabilities.

Workflow and model selection

Growing view that teams should “right‑size” models: small/cheap for rote implementation, larger models for planning, synthesis, and high‑risk tasks.
Others counter that misjudging task complexity causes wasted runs: weaker models make a mess that then must be redone with stronger ones.
Several note that human time (review, debugging, oversight) still dominates costs; until models are much more reliable, small price deltas per token are less important than accuracy and stability.

Related topics