Making o1, o3, and Sonnet 3.7 hallucinate for everyone

Hallucination vs. Regurgitation

  • Debate over whether the example is a “hallucination” or just reproducing a wrong pattern from training data.
  • One view: if the incorrect syntax appeared online, the model is regurgitating contaminated or adversarial data, not inventing facts.
  • Counterview: even with perfectly correct training data, next-token models will still produce incorrect combinations (like a Markov chain mixing “roses” with “blue and deep”), so hallucinations are inherent.

Nature and Inevitability of Hallucinations

  • Some argue everything an LLM outputs is a kind of hallucination; we only call it that when it’s wrong.
  • Others maintain “hallucination” is meaningful: it’s when the model confidently presents non-facts (invented APIs, syntax, papers) as true.
  • Clarification: in this case, the training post proposed a syntax/interface; the model incorrectly treats it as established reality.

Training Data, Contamination, and Data Poisoning

  • Concern that anyone can publish bogus syntax or backdoored patterns that get scraped and later reproduced by models.
  • Suggestion that adversaries could flood the web with plausible-looking but wrong tutorials, especially auto-generated by AI.
  • Question of why niche, wrong patterns sometimes dominate over vast amounts of correct code; proposed reasons include prompt similarity (“How do I…”) and niche topics being underweighted in training.

LLMs for Coding: Usefulness and Pitfalls

  • Many report LLMs inventing language features, CLI flags, library APIs, and OpenAPI rules, especially for niche or less-documented topics.
  • Users see them as helpful for boilerplate, scaffolding, refactors, and mainstream tech stacks, but dangerous as a primary research tool.
  • Perceived answer quality often inversely correlates with user expertise; novices are more easily impressed and misled.

Prompting, Context, and Tooling

  • Strong emphasis on detailed system prompts, specifying language versions, and feeding compiler/LSP errors back to the model.
  • Some find more context improves results; others see more hallucinations or distraction beyond a “sweet spot.”
  • Tools that integrate codebase search, static analysis, and iterative compilation are seen as more promising than raw chat.

Testing, Safety, and Reliability

  • Argument that hallucinations in code are “least dangerous” because compilation/tests expose them—countered by examples of subtle, test-passing bugs.
  • Consensus that LLM output should be treated as untrusted input: validate, test, and never blindly run, especially in security-critical or infra contexts.
  • Idea of agents that auto-compile/run code to filter out bad answers, but concerns they can still miss subtle or underhanded failures.

Intelligence and “Smartness” Debate

  • Strong pushback on calling LLMs “smart”; described instead as sophisticated next-token predictors that sometimes align with reality.
  • Others note they already match or surpass humans on some narrow tasks, but still lack genuine abstract reasoning and autonomy.
  • Frustration with anthropomorphizing; repeated comparisons to humans seen as misleading.

Hallucinations as Design Feedback

  • Some treat hallucinated flags/APIs as feature requests: the model often “invents” interfaces that would be nicer than current ones.
  • Speculation that this could guide language and framework design, especially if future models consult project-specific “support” models instead of guessing.