The Einstein AI Model
Role of AI in Science vs. Mathematics and Experimentation
- Several argue the post overemphasizes conjecture: in many fields, the hard part is proof and especially physical experimentation, not just “having the idea.”
- LLMs may be huge aids for trying many “crazy” math ideas, cleaning data, literature review, code, and mundane scientific work, even if they don’t originate revolutions.
- Others stress that real science is constrained by slow, expensive experiments (biology, medicine, physics), so even a “genius” AI would hit logistical limits.
Benchmarks and Evaluating an “Einstein Model”
- Debate over whether we need benchmarks at all if true breakthroughs can just be tested in reality; counterpoint: you still need programmatic evaluation to track model progress.
- Suggested benchmarks: train on data only up to a cutoff (e.g., pre‑2023, or pre‑1905 physics) and see if the system can rediscover later results or design the same experiments.
- Practical problems noted: lack of clean post‑cutoff corpora, legal/IP issues with proprietary scientific datasets, and the difficulty of knowing whether we’re “leading” the model.
Creativity, Questioning the Status Quo, and “Move 37”
- Many push back on romanticized “challenge the status quo” narratives: questioning is easy; being right is hard.
- Some say current systems are “straight‑A students,” optimized to agree and be helpful, not to question training data or propose truly counterintuitive axioms.
- Others note that non‑LLM AI (e.g., game agents, genetic algorithms) already shows surprising, out‑of‑the-box solutions, though often too brittle to use in practice.
Hallucinations, Honesty, and Cultural Tuning
- Long subthread on wanting models that say “I don’t know” more often versus being “overly compliant helpers.”
- Attempts to get more blunt, “Dutch‑style” behavior mostly fail; models still hallucinate and adopt American-style politeness and phrasings.
- Discussion of RLHF and product incentives: providers may tune for engagement and apparent helpfulness rather than strict factuality or epistemic humility.
Capabilities, Hype, and Moving Goalposts
- Some complain that critiques keep shifting from “not human-level” to “not Einstein-level,” calling this goalpost moving; others reply this piece is about a specific, real weakness, not denying that current systems are AI.
- Skepticism about near-term “compressed century” claims; others emphasize ongoing exponential improvements in compute and expect much better models in a few years.
- Broad agreement that near-term impact is “human + machine”: researchers ask good questions, AI accelerates search, synthesis, and iteration, not autonomous Einsteins—at least not yet.