GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

Hallucination metrics and interpretation

  • Hallucination rates discussed are conditional: they measure behavior only when the model doesn’t know or abstains, not overall error in everyday use.
  • Example: models with identical wrong-answer counts can show very different “hallucination rates” depending on how often they abstain.
  • Some argue this metric is still useful because it reflects “willingness to make things up”; others prefer a global error rate (misleading tokens / total tokens).
  • AA-Omniscience is highlighted as a better benchmark: correct answers are rewarded, hallucinations penalized, and “I don’t know” is neutral, so always guessing is punished.

Model size, data, and hallucinations

  • Thread disputes the idea that “bigger = more hallucinations” as a universal rule.
  • Examples: a smaller DeepSeek variant hallucinates heavily; a large proprietary model has relatively low hallucination rate but similar absolute hallucinations to a smaller one.
  • Some suggest large factual datasets and aggressive scaling may train models into “answering everything” instead of abstaining.
  • Others note diminishing returns in capability from more parameters and tokens, but not clear evidence that size alone drives hallucinations.

Prompting, training, and system design

  • Several comments stress that raw model comparisons ignore prompt engineering and agent setups that encourage or reward saying “I don’t know”.
  • Criticism: blaming “bad prompting” is likened to blaming users for a flawed product; expectations are set by marketing that oversells reliability.
  • There’s speculation that current RLHF/RLVR pipelines optimize for interesting, confident answers over safe abstentions, undertraining the “don’t know” behavior.

Code generation and software quality

  • Strong concern that LLM-written code may look good but embed subtle errors and “anomalies” that accumulate into unmaintainable systems.
  • Others report good results when LLMs are used as assistants, with humans reviewing, adding tests, and applying standard engineering practices.
  • Several note that pre-LLM enterprise code is also often terrible; the bar may not be as high as critics assume.

Human analogies, limits, and possible fixes

  • Comparisons are made to human overconfidence and “bullshitting”, but also to human ability to quickly recognize “I don’t know,” aided by consequences and fear.
  • Ideas floated: separate “amygdala-like” modules, secondary models to detect hallucinations, better incentives (penalize wrong answers more than abstentions).
  • Many doubt hallucinations can be fully eliminated with current architectures; best hope is reducing frequency and improving self-uncertainty signaling.