The Unreliability of LLMs and What Lies Ahead

Perceived Capabilities and Hype

  • Many see LLMs as doing “more of what computers already did”: pattern matching, data analysis, boilerplate generation, not magic new intelligence.
  • Others point out qualitatively new-feeling abilities (philosophical framing of news, reasoning about images, bespoke code/library suggestions) but agree it’s still statistical text/data processing.
  • Strong skepticism that current LLMs justify their valuation or “Cyber Christ” narrative, though most agree they’ll remain as a useful technology.

Reliability, Hallucinations, and “Lying”

  • Core complaint: models confidently output plausible but false information and fabricated rationales; in critical work this is indistinguishable from lying.
  • Several argue “lying” and “hallucination” are misleading anthropomorphic metaphors: the model has no self-knowledge or grounding, just produces likely text.
  • RLHF/feedback schemes may inadvertently select for outputs that are persuasively wrong, optimizing for deception-like behavior.

Divergent User Experiences

  • One camp: “mostly right enough” for coding, writing, brainstorming, learning; willing to live with uncertainty and verify when needed.
  • Other camp: finds outputs “mostly wrong in subtle ways,” making review cost higher than doing work from scratch.
  • This divide is framed as differing expectations, tolerance for uncertainty, domain expertise, and even personality.

Software Development Use Cases

  • Positive reports: big time savings on glue code, scripts, YAML transforms, CI configs, documentation, small DB queries, unit tests; especially in mainstream languages.
  • Critics say productivity gains are overstated: time shifts from typing to careful review, especially for large changes or legacy systems.
  • Concerns about “vibe-coded” codebases, security flaws, and future maintenance of LLM-generated sludge.

High-Stakes vs Low-Stakes Applications

  • Widely accepted for low-consequence tasks: vacation ideation, travel “vibe checks,” children’s books, vanity content, internal summaries.
  • Strong pushback on using LLMs in law, government benefits, safety-critical engineering, or financial analysis where “mostly right” is unacceptable.

Search, Summarization, and Knowledge Quality

  • LLM-based summaries in search are praised for convenience but criticized for factual inversions and reduced traffic to original sources.
  • Worry that powerful “bullshit machines” exploit people’s Gell-Mann–like tendency to trust fluent text outside their expertise.

Scientific/Technical Domains and Causality

  • Scientists report that even with tools and citations, models conflate correlated concepts, mis-group topics, and mis-handle basic domain math.
  • Multiple comments argue that genuine progress requires causal/world models and rigorous evaluation theory, not just bigger LLMs or prompt tricks.