2025-06-02

The Unreliability of LLMs and What Lies Ahead

Perceived Capabilities and Hype

Many see LLMs as doing “more of what computers already did”: pattern matching, data analysis, boilerplate generation, not magic new intelligence.
Others point out qualitatively new-feeling abilities (philosophical framing of news, reasoning about images, bespoke code/library suggestions) but agree it’s still statistical text/data processing.
Strong skepticism that current LLMs justify their valuation or “Cyber Christ” narrative, though most agree they’ll remain as a useful technology.

Reliability, Hallucinations, and “Lying”

Core complaint: models confidently output plausible but false information and fabricated rationales; in critical work this is indistinguishable from lying.
Several argue “lying” and “hallucination” are misleading anthropomorphic metaphors: the model has no self-knowledge or grounding, just produces likely text.
RLHF/feedback schemes may inadvertently select for outputs that are persuasively wrong, optimizing for deception-like behavior.

Divergent User Experiences

One camp: “mostly right enough” for coding, writing, brainstorming, learning; willing to live with uncertainty and verify when needed.
Other camp: finds outputs “mostly wrong in subtle ways,” making review cost higher than doing work from scratch.
This divide is framed as differing expectations, tolerance for uncertainty, domain expertise, and even personality.

Software Development Use Cases

Positive reports: big time savings on glue code, scripts, YAML transforms, CI configs, documentation, small DB queries, unit tests; especially in mainstream languages.
Critics say productivity gains are overstated: time shifts from typing to careful review, especially for large changes or legacy systems.
Concerns about “vibe-coded” codebases, security flaws, and future maintenance of LLM-generated sludge.

High-Stakes vs Low-Stakes Applications

Widely accepted for low-consequence tasks: vacation ideation, travel “vibe checks,” children’s books, vanity content, internal summaries.
Strong pushback on using LLMs in law, government benefits, safety-critical engineering, or financial analysis where “mostly right” is unacceptable.

Search, Summarization, and Knowledge Quality

LLM-based summaries in search are praised for convenience but criticized for factual inversions and reduced traffic to original sources.
Worry that powerful “bullshit machines” exploit people’s Gell-Mann–like tendency to trust fluent text outside their expertise.

Scientific/Technical Domains and Causality

Scientists report that even with tools and citations, models conflate correlated concepts, mis-group topics, and mis-handle basic domain math.
Multiple comments argue that genuine progress requires causal/world models and rigorous evaluation theory, not just bigger LLMs or prompt tricks.

Related topics