Making o1, o3, and Sonnet 3.7 hallucinate for everyone
Hallucination vs. Regurgitation
- Debate over whether the example is a “hallucination” or just reproducing a wrong pattern from training data.
- One view: if the incorrect syntax appeared online, the model is regurgitating contaminated or adversarial data, not inventing facts.
- Counterview: even with perfectly correct training data, next-token models will still produce incorrect combinations (like a Markov chain mixing “roses” with “blue and deep”), so hallucinations are inherent.
Nature and Inevitability of Hallucinations
- Some argue everything an LLM outputs is a kind of hallucination; we only call it that when it’s wrong.
- Others maintain “hallucination” is meaningful: it’s when the model confidently presents non-facts (invented APIs, syntax, papers) as true.
- Clarification: in this case, the training post proposed a syntax/interface; the model incorrectly treats it as established reality.
Training Data, Contamination, and Data Poisoning
- Concern that anyone can publish bogus syntax or backdoored patterns that get scraped and later reproduced by models.
- Suggestion that adversaries could flood the web with plausible-looking but wrong tutorials, especially auto-generated by AI.
- Question of why niche, wrong patterns sometimes dominate over vast amounts of correct code; proposed reasons include prompt similarity (“How do I…”) and niche topics being underweighted in training.
LLMs for Coding: Usefulness and Pitfalls
- Many report LLMs inventing language features, CLI flags, library APIs, and OpenAPI rules, especially for niche or less-documented topics.
- Users see them as helpful for boilerplate, scaffolding, refactors, and mainstream tech stacks, but dangerous as a primary research tool.
- Perceived answer quality often inversely correlates with user expertise; novices are more easily impressed and misled.
Prompting, Context, and Tooling
- Strong emphasis on detailed system prompts, specifying language versions, and feeding compiler/LSP errors back to the model.
- Some find more context improves results; others see more hallucinations or distraction beyond a “sweet spot.”
- Tools that integrate codebase search, static analysis, and iterative compilation are seen as more promising than raw chat.
Testing, Safety, and Reliability
- Argument that hallucinations in code are “least dangerous” because compilation/tests expose them—countered by examples of subtle, test-passing bugs.
- Consensus that LLM output should be treated as untrusted input: validate, test, and never blindly run, especially in security-critical or infra contexts.
- Idea of agents that auto-compile/run code to filter out bad answers, but concerns they can still miss subtle or underhanded failures.
Intelligence and “Smartness” Debate
- Strong pushback on calling LLMs “smart”; described instead as sophisticated next-token predictors that sometimes align with reality.
- Others note they already match or surpass humans on some narrow tasks, but still lack genuine abstract reasoning and autonomy.
- Frustration with anthropomorphizing; repeated comparisons to humans seen as misleading.
Hallucinations as Design Feedback
- Some treat hallucinated flags/APIs as feature requests: the model often “invents” interfaces that would be nicer than current ones.
- Speculation that this could guide language and framework design, especially if future models consult project-specific “support” models instead of guessing.