2026-06-19

GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

Hallucination metrics and interpretation

Hallucination rates discussed are conditional: they measure behavior only when the model doesn’t know or abstains, not overall error in everyday use.
Example: models with identical wrong-answer counts can show very different “hallucination rates” depending on how often they abstain.
Some argue this metric is still useful because it reflects “willingness to make things up”; others prefer a global error rate (misleading tokens / total tokens).
AA-Omniscience is highlighted as a better benchmark: correct answers are rewarded, hallucinations penalized, and “I don’t know” is neutral, so always guessing is punished.

Model size, data, and hallucinations

Thread disputes the idea that “bigger = more hallucinations” as a universal rule.
Examples: a smaller DeepSeek variant hallucinates heavily; a large proprietary model has relatively low hallucination rate but similar absolute hallucinations to a smaller one.
Some suggest large factual datasets and aggressive scaling may train models into “answering everything” instead of abstaining.
Others note diminishing returns in capability from more parameters and tokens, but not clear evidence that size alone drives hallucinations.

Prompting, training, and system design

Several comments stress that raw model comparisons ignore prompt engineering and agent setups that encourage or reward saying “I don’t know”.
Criticism: blaming “bad prompting” is likened to blaming users for a flawed product; expectations are set by marketing that oversells reliability.
There’s speculation that current RLHF/RLVR pipelines optimize for interesting, confident answers over safe abstentions, undertraining the “don’t know” behavior.

Code generation and software quality

Strong concern that LLM-written code may look good but embed subtle errors and “anomalies” that accumulate into unmaintainable systems.
Others report good results when LLMs are used as assistants, with humans reviewing, adding tests, and applying standard engineering practices.
Several note that pre-LLM enterprise code is also often terrible; the bar may not be as high as critics assume.

Human analogies, limits, and possible fixes

Comparisons are made to human overconfidence and “bullshitting”, but also to human ability to quickly recognize “I don’t know,” aided by consequences and fear.
Ideas floated: separate “amygdala-like” modules, secondary models to detect hallucinations, better incentives (penalize wrong answers more than abstentions).
Many doubt hallucinations can be fully eliminated with current architectures; best hope is reducing frequency and improving self-uncertainty signaling.

Related topics