TimeCapsuleLLM: LLM trained only on data from 1800-1875

Idea: Time-Limited Training as AGI Test

  • Many propose training a powerful model only on pre‑1900 (or similar) data and testing whether it can “rediscover” relativity, QM, or other major theories.
  • If it could derive anything substantially correct from period knowledge plus experimental results, some see that as strong evidence LLMs can do more than regurgitate.
  • Others argue the result would be uninformative or too easy to contaminate with post‑cutoff data.

Feasibility and Data Limitations

  • Major obstacle: not enough digitized, high‑quality pre‑1900 text to reach modern frontier scales; surviving text is skewed toward elites, newspapers, and tertiary sources.
  • OCR noise and metadata leaks are pervasive; avoiding post‑1900 contamination is hard.
  • Lack of era‑appropriate RLHF is another practical blocker.

Debate: Do LLMs “Think”?

  • One camp: LLMs are just token predictors, not capable of genuine reasoning or creating new paradigms; human cognition uses richer mechanisms than pattern continuation.
  • Counter‑camp: even if the training objective is next‑token prediction, internal representations can encode concepts and world models; emergent “concept manipulation” is argued and supported by interpretability work.
  • Some suggest language/token manipulation may be more central to human thought than assumed—but probably still not the whole story.

Einstein, Relativity, and Scientific Discovery

  • Several note that by 1900 many “building blocks” of relativity and QM existed (experiments, math, partial theories).
  • Disagreement centers on whether synthesis required uniquely human “abductive leaps” and willingness to reject prevailing axioms, or whether a large model could, in principle, find similar theories by recombining literature and simulated experiments.
  • Even if it could match Einstein once, it’s unclear whether such a system could keep pushing science forward indefinitely.

Alternative Evaluations and Benchmarks

  • Suggestions include:
    • Training era‑cutoff models and testing them on future corpora as compression/perplexity benchmarks.
    • Time‑sliced SWE and science benchmarks (pre‑date training vs post‑date evaluation).
    • Letting a pre‑cutoff model propose experiments while “nature” is simulated by humans or code.

Historical Simulation, Bias, and Use Cases

  • Many are excited about models that “speak from” a given era to expose historical mindsets, biases, and blind spots.
  • Others caution that such models reflect archival survivorship bias and may overrepresent official or elite voices.
  • Some see value in copyright‑clean, cutoff models as research tools and for safer experimentation.

Current TimeCapsuleLLM Quality and Engineering Notes

  • Users report outputs often resemble a Markov chain: repetitive, incoherent, and not chat‑ready.
  • Models are small (hundreds of millions of parameters) and lack serious post‑training or instruction tuning, limiting their usefulness beyond the proof‑of‑concept.
  • Calls for better dataset release, curation, reproducible scripts, and easy chat/web demos are common.