GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers
Impact on science and reproducibility
- Many see this as exacerbating an existing reproducibility and fraud crisis: LLMs make it cheaper to generate plausible but bogus work, worsening an already noisy literature.
- Some argue this might finally force the community to value replication, verification, and code/data sharing (PoC-or-GTFO) instead of novelty-first publication.
- Others counter that reproducibility alone is overrated; underlying quality, incentives, and review culture must change first.
Incentives, publish-or-perish, and peer review overload
- Commenters describe a system driven by “publish or perish” and grant chasing, where volume and h‑index dominate quality.
- Top AI conferences are swamped (tens of thousands of submissions, large growth since 2020), leading to thin, sometimes AI‑generated reviews and little checking of references.
- Reviewers say they focus on correctness and novelty, not verifying 30–50 citations per paper; fake or wrong references in introductions are rarely caught.
LLMs, fraud, and accountability
- Hallucinated references are seen as a bright‑line indicator of either LLM misuse or serious negligence; many say that once you see one, you stop trusting the rest of the paper.
- Some want severe sanctions (retractions, lifetime bans, even criminal fraud charges when public money is involved); others argue that’s excessive and requires strong due process.
- There’s a distinction drawn between using LLMs for language polishing/translation versus letting them invent citations, text, or results.
Citation checking, tooling, and proposed reforms
- Multiple people ask why conferences don’t automatically validate references (DOIs, Crossref/OpenAlex, Semantic Scholar, etc.) and flag non‑resolving or obviously fake entries.
- Suggested fixes: automated “lint” for bib files; reproducibility tracks; explicit replication journals; linking papers to confirmed/failed replications; grants that fund independent reproduction.
- Some note that even pre‑AI, tools like Google Scholar produce flawed BibTeX; minor metadata errors shouldn’t be equated with full hallucinations.
Skepticism about GPTZero and narrative framing
- Several see the post as a marketing piece: a “shame list” of authors used to sell GPTZero’s product, without base‑rate comparisons to pre‑LLM years.
- Concerns are raised that AI detectors themselves hallucinate and have already harmed students falsely accused of using AI.
- Others respond that, ad or not, surfacing fabricated citations at a flagship conference is valuable and highlights a real structural problem.
Broader reflections
- Some argue the root issues—overcitation, status games, lack of consequences for bad work, English‑centric publishing—long predate LLMs; AI just makes the cracks visible at scale.
- There’s recurring tension between fear of “AI slop” and recognition that AI can also assist with search, translation, and tooling—if humans remain accountable for every claim and citation.