Tokens are getting more expensive

Overusing SOTA Models vs Right-Sizing

  • Many argue we’re “smashing gnats with sledgehammers”: 7–32B and cheaper models are perfectly adequate for many tasks, especially structured workflows and basic coding/helpdesk tasks.
  • Some users already mix multiple models (e.g., 4–5 different ones in one app) to balance quality vs cost.
  • Others counter that average users don’t want to choose model size; they judge tools by worst-case failures, so they gravitate to frontier models.
  • There’s interest in orchestrations where an expensive “thinking” model delegates subtasks to cheaper ones—essentially MoE at the product level; parts of Claude Code and tools like Aider already approximate this.

Token Costs, Usage Patterns, and “Unlimited” Plans

  • Several commenters say tokens are getting cheaper per unit, but total usage is exploding—especially with coding agents that use huge contexts, repeated calls, and orchestration.
  • People report burning through tens of dollars in minutes/hours with tools like Claude Code and Gemini CLI, in contrast to very low spend for simple chat/API use.
  • Many dispute the article’s claim that “99% of demand” goes to the latest SOTA: usage data via OpenRouter shows cheaper but strong models (Claude Sonnet, Gemini Flash, Mistral) dominating volume; true “max” models are niche.
  • Consensus that “unlimited” flat plans get destroyed by a small number of heavy users (Zipf-like usage). Anthropic-style time/weekly quotas are seen as more sustainable than true unlimited.

Metered Billing, Opaqueness, and AWS Analogies

  • Strong dislike of opaque, surprise metered billing (AI, AWS, GitHub Copilot). Users want: real-time token/$ counters, clear limits, and hard caps or auto-shutdown thresholds.
  • Others argue metered billing is fine for infra/B2B where usage is predictable and budgets exist, but it discourages everyday individual use because each request feels like a tiny financial decision.
  • Comparisons to utilities and telecom: predictable flat payments are psychologically easier, even if slightly overpriced.

Local/Open Models and Edge Compute

  • Some participants avoid subscriptions by using open-source frontends with direct API billing or by running local models on GPUs/Cloud Run, trading throughput and quality for predictable costs and privacy.
  • There’s interest in “edge-first” architectures and specialized local models to avoid cloud token economics.

Meta: Writing Style and Hype

  • A large subthread debates the author’s all-lowercase style: some see it as lazy or unreadable; others as a generational or anti-LLM aesthetic.
  • Several commenters view the article as “vibes-based” and speculative, noting real serving costs are unknown and current discourse is driven more by hype than hard unit economics.