TimeCapsuleLLM: LLM trained only on data from 1800-1875
Idea: Time-Limited Training as AGI Test
- Many propose training a powerful model only on pre‑1900 (or similar) data and testing whether it can “rediscover” relativity, QM, or other major theories.
- If it could derive anything substantially correct from period knowledge plus experimental results, some see that as strong evidence LLMs can do more than regurgitate.
- Others argue the result would be uninformative or too easy to contaminate with post‑cutoff data.
Feasibility and Data Limitations
- Major obstacle: not enough digitized, high‑quality pre‑1900 text to reach modern frontier scales; surviving text is skewed toward elites, newspapers, and tertiary sources.
- OCR noise and metadata leaks are pervasive; avoiding post‑1900 contamination is hard.
- Lack of era‑appropriate RLHF is another practical blocker.
Debate: Do LLMs “Think”?
- One camp: LLMs are just token predictors, not capable of genuine reasoning or creating new paradigms; human cognition uses richer mechanisms than pattern continuation.
- Counter‑camp: even if the training objective is next‑token prediction, internal representations can encode concepts and world models; emergent “concept manipulation” is argued and supported by interpretability work.
- Some suggest language/token manipulation may be more central to human thought than assumed—but probably still not the whole story.
Einstein, Relativity, and Scientific Discovery
- Several note that by 1900 many “building blocks” of relativity and QM existed (experiments, math, partial theories).
- Disagreement centers on whether synthesis required uniquely human “abductive leaps” and willingness to reject prevailing axioms, or whether a large model could, in principle, find similar theories by recombining literature and simulated experiments.
- Even if it could match Einstein once, it’s unclear whether such a system could keep pushing science forward indefinitely.
Alternative Evaluations and Benchmarks
- Suggestions include:
- Training era‑cutoff models and testing them on future corpora as compression/perplexity benchmarks.
- Time‑sliced SWE and science benchmarks (pre‑date training vs post‑date evaluation).
- Letting a pre‑cutoff model propose experiments while “nature” is simulated by humans or code.
Historical Simulation, Bias, and Use Cases
- Many are excited about models that “speak from” a given era to expose historical mindsets, biases, and blind spots.
- Others caution that such models reflect archival survivorship bias and may overrepresent official or elite voices.
- Some see value in copyright‑clean, cutoff models as research tools and for safer experimentation.
Current TimeCapsuleLLM Quality and Engineering Notes
- Users report outputs often resemble a Markov chain: repetitive, incoherent, and not chat‑ready.
- Models are small (hundreds of millions of parameters) and lack serious post‑training or instruction tuning, limiting their usefulness beyond the proof‑of‑concept.
- Calls for better dataset release, curation, reproducible scripts, and easy chat/web demos are common.