2025-03-01

Making o1, o3, and Sonnet 3.7 hallucinate for everyone

Hallucination vs. Regurgitation

Debate over whether the example is a “hallucination” or just reproducing a wrong pattern from training data.
One view: if the incorrect syntax appeared online, the model is regurgitating contaminated or adversarial data, not inventing facts.
Counterview: even with perfectly correct training data, next-token models will still produce incorrect combinations (like a Markov chain mixing “roses” with “blue and deep”), so hallucinations are inherent.

Nature and Inevitability of Hallucinations

Some argue everything an LLM outputs is a kind of hallucination; we only call it that when it’s wrong.
Others maintain “hallucination” is meaningful: it’s when the model confidently presents non-facts (invented APIs, syntax, papers) as true.
Clarification: in this case, the training post proposed a syntax/interface; the model incorrectly treats it as established reality.

Training Data, Contamination, and Data Poisoning

Concern that anyone can publish bogus syntax or backdoored patterns that get scraped and later reproduced by models.
Suggestion that adversaries could flood the web with plausible-looking but wrong tutorials, especially auto-generated by AI.
Question of why niche, wrong patterns sometimes dominate over vast amounts of correct code; proposed reasons include prompt similarity (“How do I…”) and niche topics being underweighted in training.

LLMs for Coding: Usefulness and Pitfalls

Many report LLMs inventing language features, CLI flags, library APIs, and OpenAPI rules, especially for niche or less-documented topics.
Users see them as helpful for boilerplate, scaffolding, refactors, and mainstream tech stacks, but dangerous as a primary research tool.
Perceived answer quality often inversely correlates with user expertise; novices are more easily impressed and misled.

Prompting, Context, and Tooling

Strong emphasis on detailed system prompts, specifying language versions, and feeding compiler/LSP errors back to the model.
Some find more context improves results; others see more hallucinations or distraction beyond a “sweet spot.”
Tools that integrate codebase search, static analysis, and iterative compilation are seen as more promising than raw chat.

Testing, Safety, and Reliability

Argument that hallucinations in code are “least dangerous” because compilation/tests expose them—countered by examples of subtle, test-passing bugs.
Consensus that LLM output should be treated as untrusted input: validate, test, and never blindly run, especially in security-critical or infra contexts.
Idea of agents that auto-compile/run code to filter out bad answers, but concerns they can still miss subtle or underhanded failures.

Intelligence and “Smartness” Debate

Strong pushback on calling LLMs “smart”; described instead as sophisticated next-token predictors that sometimes align with reality.
Others note they already match or surpass humans on some narrow tasks, but still lack genuine abstract reasoning and autonomy.
Frustration with anthropomorphizing; repeated comparisons to humans seen as misleading.

Hallucinations as Design Feedback

Some treat hallucinated flags/APIs as feature requests: the model often “invents” interfaces that would be nicer than current ones.
Speculation that this could guide language and framework design, especially if future models consult project-specific “support” models instead of guessing.

Related topics