Tokens are getting more expensive
Overusing SOTA Models vs Right-Sizing
- Many argue we’re “smashing gnats with sledgehammers”: 7–32B and cheaper models are perfectly adequate for many tasks, especially structured workflows and basic coding/helpdesk tasks.
- Some users already mix multiple models (e.g., 4–5 different ones in one app) to balance quality vs cost.
- Others counter that average users don’t want to choose model size; they judge tools by worst-case failures, so they gravitate to frontier models.
- There’s interest in orchestrations where an expensive “thinking” model delegates subtasks to cheaper ones—essentially MoE at the product level; parts of Claude Code and tools like Aider already approximate this.
Token Costs, Usage Patterns, and “Unlimited” Plans
- Several commenters say tokens are getting cheaper per unit, but total usage is exploding—especially with coding agents that use huge contexts, repeated calls, and orchestration.
- People report burning through tens of dollars in minutes/hours with tools like Claude Code and Gemini CLI, in contrast to very low spend for simple chat/API use.
- Many dispute the article’s claim that “99% of demand” goes to the latest SOTA: usage data via OpenRouter shows cheaper but strong models (Claude Sonnet, Gemini Flash, Mistral) dominating volume; true “max” models are niche.
- Consensus that “unlimited” flat plans get destroyed by a small number of heavy users (Zipf-like usage). Anthropic-style time/weekly quotas are seen as more sustainable than true unlimited.
Metered Billing, Opaqueness, and AWS Analogies
- Strong dislike of opaque, surprise metered billing (AI, AWS, GitHub Copilot). Users want: real-time token/$ counters, clear limits, and hard caps or auto-shutdown thresholds.
- Others argue metered billing is fine for infra/B2B where usage is predictable and budgets exist, but it discourages everyday individual use because each request feels like a tiny financial decision.
- Comparisons to utilities and telecom: predictable flat payments are psychologically easier, even if slightly overpriced.
Local/Open Models and Edge Compute
- Some participants avoid subscriptions by using open-source frontends with direct API billing or by running local models on GPUs/Cloud Run, trading throughput and quality for predictable costs and privacy.
- There’s interest in “edge-first” architectures and specialized local models to avoid cloud token economics.
Meta: Writing Style and Hype
- A large subthread debates the author’s all-lowercase style: some see it as lazy or unreadable; others as a generational or anti-LLM aesthetic.
- Several commenters view the article as “vibes-based” and speculative, noting real serving costs are unknown and current discourse is driven more by hype than hard unit economics.