Measuring Claude 4.7's tokenizer costs

Tokenization changes and cost impact

  • Multiple users report 20–35% more tokens for similar tasks with Opus 4.7, especially on code and large-context work.
  • Some subscription users hit weekly limits in a day; API users see this as a direct price hike, while cached input can soften the blow.
  • Confusion persists over Anthropic’s “usage limits” messaging vs underlying per-token billing.

Perceived quality: Opus 4.7 vs 4.6

  • Some see 4.7 as a clear upgrade: higher one‑shot success rate, better instruction following, fewer irrelevant tangents, stronger coding and planning.
  • Others report regressions: more hallucinations (including fake tools), refusal to modify benign code due to malware checks, getting stuck in “side quests,” and worse results on established internal tasks.
  • A few controlled tests claim:
    • 4.7 ≈ “old” 4.6 cost, ~20% costlier than “new” (apparently throttled) 4.6.
    • In at least one domain benchmark, 4.7 was both cheaper and more accurate due to shorter reasoning chains.
  • Several users feel Opus 4.6 was silently “nerfed” prior to 4.7’s release.

Effort levels, reasoning, and compaction

  • Anthropic added 5 “effort” modes; many find this confusing and suspect it increases the chance users overpay.
  • Docs now recommend xhigh (not max) for coding/agentic use; max is described as prone to overthinking and heavy token use.
  • Aggressive context compaction introduces multi‑minute pauses and more tool calls, which users experience as both latency and hidden cost.

Pricing, incentives, and “enshittification” concerns

  • Some argue higher costs are inevitable: compute is expensive, VC subsidies are ending, and enterprise demand is strong.
  • Others see “shrinkflation”: more tokens, vaguely described improvements, nerfed old models, and frequent new “flagship” releases as a way to extract more revenue.
  • There is debate over whether public‑company pressures will push Anthropic toward profit maximization at the expense of user alignment and transparency.

Benchmarks, measurement, and A/B testing

  • Users note that common benchmarks have high variance, are easy to game, and often lack sufficient sample sizes to detect small improvements.
  • Claims that Anthropic A/B‑tested 4.6 vs 4.7 in production and reduced 4.6’s “thinking” to free capacity feed degradation and conspiracy narratives.
  • Many emphasize that what really matters is “cost per successful task,” not cost per token or per session, but this is hard and expensive to measure.

Open-source and local models

  • Several participants are moving some work to open models (Qwen, Gemma, GLM) via local or third‑party hosting, citing cost control and predictable behavior.
  • Consensus: open models are improving fast and are “good enough” for many tasks, but still lag top proprietary models for complex, high‑stakes coding and agentic workflows.
  • Hardware requirements for truly frontier‑like local performance remain high; some warn local‑model enthusiasts are overselling current capabilities.

Workflow and model selection

  • Growing view that teams should “right‑size” models: small/cheap for rote implementation, larger models for planning, synthesis, and high‑risk tasks.
  • Others counter that misjudging task complexity causes wasted runs: weaker models make a mess that then must be redone with stronger ones.
  • Several note that human time (review, debugging, oversight) still dominates costs; until models are much more reliable, small price deltas per token are less important than accuracy and stability.