2025-03-08

The Einstein AI Model

Role of AI in Science vs. Mathematics and Experimentation

Several argue the post overemphasizes conjecture: in many fields, the hard part is proof and especially physical experimentation, not just “having the idea.”
LLMs may be huge aids for trying many “crazy” math ideas, cleaning data, literature review, code, and mundane scientific work, even if they don’t originate revolutions.
Others stress that real science is constrained by slow, expensive experiments (biology, medicine, physics), so even a “genius” AI would hit logistical limits.

Benchmarks and Evaluating an “Einstein Model”

Debate over whether we need benchmarks at all if true breakthroughs can just be tested in reality; counterpoint: you still need programmatic evaluation to track model progress.
Suggested benchmarks: train on data only up to a cutoff (e.g., pre‑2023, or pre‑1905 physics) and see if the system can rediscover later results or design the same experiments.
Practical problems noted: lack of clean post‑cutoff corpora, legal/IP issues with proprietary scientific datasets, and the difficulty of knowing whether we’re “leading” the model.

Creativity, Questioning the Status Quo, and “Move 37”

Many push back on romanticized “challenge the status quo” narratives: questioning is easy; being right is hard.
Some say current systems are “straight‑A students,” optimized to agree and be helpful, not to question training data or propose truly counterintuitive axioms.
Others note that non‑LLM AI (e.g., game agents, genetic algorithms) already shows surprising, out‑of‑the-box solutions, though often too brittle to use in practice.

Hallucinations, Honesty, and Cultural Tuning

Long subthread on wanting models that say “I don’t know” more often versus being “overly compliant helpers.”
Attempts to get more blunt, “Dutch‑style” behavior mostly fail; models still hallucinate and adopt American-style politeness and phrasings.
Discussion of RLHF and product incentives: providers may tune for engagement and apparent helpfulness rather than strict factuality or epistemic humility.

Capabilities, Hype, and Moving Goalposts

Some complain that critiques keep shifting from “not human-level” to “not Einstein-level,” calling this goalpost moving; others reply this piece is about a specific, real weakness, not denying that current systems are AI.
Skepticism about near-term “compressed century” claims; others emphasize ongoing exponential improvements in compute and expect much better models in a few years.
Broad agreement that near-term impact is “human + machine”: researchers ask good questions, AI accelerates search, synthesis, and iteration, not autonomous Einsteins—at least not yet.

Related topics