2025-03-02

Hallucinations in code are the least dangerous form of LLM mistakes

Value and limits of LLM‑generated code

Several commenters report successfully building non‑trivial systems (DSLs, web servers, lab scripts, SaaS scaffolding) with LLMs, especially when constrained by familiar stacks and libraries.
Others say LLMs are great for boilerplate, unit tests, demos, or “toy” projects but break down on large, evolving codebases, complex C/C++ APIs, or subtle concurrency and memory issues.
Some find LLM codebases depressing or uninteresting to study, feeling they remove the “romance” and learning value of human‑written open source.

What counts as a hallucination?

Disagreement over terminology: some restrict “hallucination” to invented APIs/facts; others see any wrong output (including logic bugs) as hallucination; some argue the term is misleading anthropomorphism.
Many note that hallucinated methods are often the least dangerous issues; far worse are plausible but wrong logic, mis-specified behavior, or silently ignored edge cases.
Examples: incorrect ZeroMQ memory handling, wrong lexing line numbers, silent allocation failures, misinterpreted sorting logic, missing features after refactors, or misdescribed behavior in comments.

Code review, testing, and trust

Strong pushback on “if you have to review it, you’re bad at reviewing code”: reviewers stress that reading unknown code (especially without a human author’s intent) is intrinsically slow and hard.
Multiple people liken LLM-heavy workflows to “full self‑driving, but keep your hands on the wheel”: over time, humans will stop truly supervising, which is when rare but severe failures matter.
Consensus that tests can’t prove correctness, only expose some errors; high‑risk code still requires reasoning about requirements, invariants, and race conditions.
Concern that LLM‑written tests may simply encode the same misunderstandings as the implementation.

Safety, persuasion, and broader risks

Several argue hallucinations in code are minor compared to risks from persuasive chatbots encouraging self‑harm or violence; cite real incidents and worry about increasingly “people‑pleasing” models.
Debate over whether future highly persuasive models could “own” users cognitively vs. claims this repeats old moral panics about books, films, and video games.
Some suggest restricting access or adding “safety buffers” between powerful models and end users; others see this as censorship and corporate moat‑building.

Maintainability, architecture, and ecosystem effects

Common complaint: LLMs produce inconsistent patterns, over‑engineering, weird abstractions, repeated CSS/styles, and poor error handling—harder to maintain than hand‑written code.
Worry that devs will choose “boring” or popular tech purely because models know it, reducing innovation and pushing ecosystems toward what’s well‑represented in training data.
Security concerns include prompt‑driven supply‑chain attacks via hallucinated packages and the ease of mass‑producing superficially good but subtly vulnerable code.

Related topics