2025-06-03

Deep learning gets the glory, deep fact checking gets ignored

AI for reproducing vs generating research

Many argue AI should first reliably reproduce existing research (implementing methods from papers, or re-deriving classic experiments) before being trusted to generate new science.
Some see value in having models finish partially written papers or reproduce raw data from statistical descriptions, but stress this still requires human auditing and strict dataset controls to avoid leakage.
Others suggest benchmarks that restrict training data to pre‑discovery knowledge and ask whether an AI can rediscover seminal results (e.g., classic physics experiments).

Verification, reproducibility, and incentives

Reproducibility work is common but usually invisible: researchers often re‑implement “X with Y” privately before publishing “Z with Y”; failed replications are rarely published.
Incentives in academia and industry favor novelty and citations over robustness, discouraging release of code, data, and careful error-checking.
Sensational but wrong papers often get more attention than sober refutations; rebuttals and replication papers are hard to publish and under‑cited.

Biology and domain-specific challenges

In biology, validating model predictions (e.g., protein function) can take years of purification, expression, localization, knockout, and binding studies; often yielding ambiguous or contradictory results.
Because experimental validation is so costly, few will spend years testing “random model predictions,” making flashy but wrong ML biology papers hard to dislodge.

Limits of deep learning and LLMs

Several commenters emphasize that models trained on lossy encodings of complex domains (like biology) will inevitably produce confident nonsense; in NLP, humans can cheaply spot errors, but not in wet-lab science.
Transformers often achieve impressive test metrics yet fail in real-world deployment, suggesting overfitting to dataset quirks or leakage. Extremely high reported accuracies are treated as a red flag.
Data contamination is seen as pervasive and hard to rule out at web scale; some argue we should assume leakage unless strongly proven otherwise.

Hype vs grounded utility

LLMs are likened to “stochastic parrots” and “talking sand”: astonishingly capable at language and coding assistance, but fundamentally unreliable without external checks.
They can excel at brainstorming, lit reviews, code generation, and as junior-assistant-like tools when paired with linters, tests, and human review, but are unsuited as unsupervised “AI scientists.”
Many see the core future challenge as: building systems and institutions that reward deep fact-checking and verification at least as much as eye-catching model demos.

Related topics