2025-07-04

Everything around LLMs is still magical and wishful thinking

Crypto vs. LLMs: Similar Hype, Different Substance

Some see “it’s crypto all over again”: heavy marketing, exaggerated claims, and a social environment where criticism is dismissed.
Others argue the analogy is shallow: crypto never found broad legal-economy uses beyond censorship‑resistant payments (though that’s life-or-death useful for some), whereas LLMs already have many mainstream, non-speculative applications.
A recurring point: both fields suffer from dishonest or naive overpromising, which drives away people who might benefit from a sober understanding.

Real-World Utility Reports

Strong positive anecdotes:
- Classifying invoices, data science tasks, PCAP analysis, transcribing and mining thousands of calls, summarizing large text corpora, drafting legal documents, research assistance, and brainstorming.
- Code help: debugging, boilerplate, refactors, unit tests, SQL, “rubber-duck” architecture discussions; some claim 2–5x personal output, a few claim “LLMs write nearly all my production code” with human review.
Many treat LLMs as high-level languages or “thinking partners” rather than autonomous agents.

Limits, Failure Modes, and Trust

Frequent failure modes: hallucinated APIs, protocols, citations, laws; ignoring project docs; forgetting instructions; weak math; poor performance in niche stacks or complex architecture; brittle behavior across sessions.
Strong warnings against use for mission-critical code, safety‑critical systems, or unsupervised legal filings; multiple external examples of AI-caused legal errors are cited.
Several stress that LLMs can make users feel productive while quietly injecting subtle bugs or conceptual slop.

Productivity Claims and Measurement Problems

Practitioners report modest average gains (often ~10–30%) rather than “10x”, due to non-coding overheads and review costs.
Management fixates on headline multipliers; internal “success” metrics are often narrow or methodologically weak.
The article’s main critique, echoed by some commenters: sweeping claims (“Claude writes most of X’s code”, “I’m 5x everyone else”) are anecdotal, unverifiable, and lack crucial context (domain, baseline skill, quality standards, review rigor).

Economics, Cost, and Open Models

Debate over sustainability: huge training spend vs currently limited impact on GDP and heavy VC subsidies.
Open-weight models (e.g., Llama family, Qwen) are seen as a check on API pricing and vendor moats; legal attacks on them could shift power back to a few incumbents.
Many expect strong local models on consumer hardware to be “good enough” for most work, even if bleeding edge remains centralized.

Workflows, Methodology, and “Prompt Engineering”

Effective users describe careful, iterative workflows: targeted prompts into known code regions, explicit test planning, checklist-driven agents, and strict human auditing.
Others find that writing robust prompts and then verifying output can take as long as doing the work manually, especially for novel problems or messy legacy systems.
General consensus: LLMs amplify good engineers and good processes; in weak contexts they mainly accelerate the production of low-quality output.

Broader Impacts and Open Questions

Concern about: erosion of junior roles and skill pipelines, AI-generated “slop” in codebases and documents, overhype driving bad management decisions and premature layoffs.
Some foresee large efficiency gains in “manual data pipelining” and back-office work, with humans shifting toward verification and liability-bearing roles.
Safety issues like prompt injection and limited context are flagged as fundamental, under-addressed constraints.
Many commenters reject both “magic” and “useless” extremes, calling for rigorous, domain-specific evaluation rather than vibes-based extrapolation.

Related topics