GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2
Hallucination metrics and interpretation
- Hallucination rates discussed are conditional: they measure behavior only when the model doesn’t know or abstains, not overall error in everyday use.
- Example: models with identical wrong-answer counts can show very different “hallucination rates” depending on how often they abstain.
- Some argue this metric is still useful because it reflects “willingness to make things up”; others prefer a global error rate (misleading tokens / total tokens).
- AA-Omniscience is highlighted as a better benchmark: correct answers are rewarded, hallucinations penalized, and “I don’t know” is neutral, so always guessing is punished.
Model size, data, and hallucinations
- Thread disputes the idea that “bigger = more hallucinations” as a universal rule.
- Examples: a smaller DeepSeek variant hallucinates heavily; a large proprietary model has relatively low hallucination rate but similar absolute hallucinations to a smaller one.
- Some suggest large factual datasets and aggressive scaling may train models into “answering everything” instead of abstaining.
- Others note diminishing returns in capability from more parameters and tokens, but not clear evidence that size alone drives hallucinations.
Prompting, training, and system design
- Several comments stress that raw model comparisons ignore prompt engineering and agent setups that encourage or reward saying “I don’t know”.
- Criticism: blaming “bad prompting” is likened to blaming users for a flawed product; expectations are set by marketing that oversells reliability.
- There’s speculation that current RLHF/RLVR pipelines optimize for interesting, confident answers over safe abstentions, undertraining the “don’t know” behavior.
Code generation and software quality
- Strong concern that LLM-written code may look good but embed subtle errors and “anomalies” that accumulate into unmaintainable systems.
- Others report good results when LLMs are used as assistants, with humans reviewing, adding tests, and applying standard engineering practices.
- Several note that pre-LLM enterprise code is also often terrible; the bar may not be as high as critics assume.
Human analogies, limits, and possible fixes
- Comparisons are made to human overconfidence and “bullshitting”, but also to human ability to quickly recognize “I don’t know,” aided by consequences and fear.
- Ideas floated: separate “amygdala-like” modules, secondary models to detect hallucinations, better incentives (penalize wrong answers more than abstentions).
- Many doubt hallucinations can be fully eliminated with current architectures; best hope is reducing frequency and improving self-uncertainty signaling.