2025-08-03

Tokens are getting more expensive

Overusing SOTA Models vs Right-Sizing

Many argue we’re “smashing gnats with sledgehammers”: 7–32B and cheaper models are perfectly adequate for many tasks, especially structured workflows and basic coding/helpdesk tasks.
Some users already mix multiple models (e.g., 4–5 different ones in one app) to balance quality vs cost.
Others counter that average users don’t want to choose model size; they judge tools by worst-case failures, so they gravitate to frontier models.
There’s interest in orchestrations where an expensive “thinking” model delegates subtasks to cheaper ones—essentially MoE at the product level; parts of Claude Code and tools like Aider already approximate this.

Token Costs, Usage Patterns, and “Unlimited” Plans

Several commenters say tokens are getting cheaper per unit, but total usage is exploding—especially with coding agents that use huge contexts, repeated calls, and orchestration.
People report burning through tens of dollars in minutes/hours with tools like Claude Code and Gemini CLI, in contrast to very low spend for simple chat/API use.
Many dispute the article’s claim that “99% of demand” goes to the latest SOTA: usage data via OpenRouter shows cheaper but strong models (Claude Sonnet, Gemini Flash, Mistral) dominating volume; true “max” models are niche.
Consensus that “unlimited” flat plans get destroyed by a small number of heavy users (Zipf-like usage). Anthropic-style time/weekly quotas are seen as more sustainable than true unlimited.

Metered Billing, Opaqueness, and AWS Analogies

Strong dislike of opaque, surprise metered billing (AI, AWS, GitHub Copilot). Users want: real-time token/$ counters, clear limits, and hard caps or auto-shutdown thresholds.
Others argue metered billing is fine for infra/B2B where usage is predictable and budgets exist, but it discourages everyday individual use because each request feels like a tiny financial decision.
Comparisons to utilities and telecom: predictable flat payments are psychologically easier, even if slightly overpriced.

Local/Open Models and Edge Compute

Some participants avoid subscriptions by using open-source frontends with direct API billing or by running local models on GPUs/Cloud Run, trading throughput and quality for predictable costs and privacy.
There’s interest in “edge-first” architectures and specialized local models to avoid cloud token economics.

Meta: Writing Style and Hype

A large subthread debates the author’s all-lowercase style: some see it as lazy or unreadable; others as a generational or anti-LLM aesthetic.
Several commenters view the article as “vibes-based” and speculative, noting real serving costs are unknown and current discourse is driven more by hype than hard unit economics.

Related topics