2024-10-03

The real data wall is billions of years of evolution

Compute, Data, and Model Architecture

Several argue current progress is driven primarily by massive compute and memory bandwidth, with data already covering “things people talk about” well.
Others stress architectural advances (convolutions, transformers, longer context windows, better filtering) can yield big gains without more data, and we are far from an “optimal model wall.”
Some see “intelligence” as high sample efficiency: doing more with less data, partly via better information filtering and compression.

Evolution, DNA, and What’s Really Learned

One camp agrees that evolution provides powerful “pre-programming,” but says this is mainly architecture/sensors/organism design, not literal stored “training data.”
Critics say treating billions of years of evolution as something akin to GPT-style pretraining data is misleading or “Lamarckian”; evolution shapes structure and instincts, not direct experiential memories.
Others counter that, in a broad sense, evolution itself is a learning process over genes and environments, so calling that “data” is reasonable, though details remain unclear.

Embodiment, Sensory Data, and Grounding

Many emphasize that humans learn through long sensorimotor interaction with the real world (childhood, bodily symmetry, multi-sensory integration), giving grounded causal intuition text-only LLMs lack.
Blind/deaf humans are cited both as evidence that no single modality is essential and as support for rich multimodal pretraining.
Some suggest the true “data wall” is the massive, continuous, embodied experience from infancy onward.

Language, Culture, and Social Learning

A strong thread holds that the key differentiator is language and culture, not DNA alone: symbolic communication enables cumulative, cross-generational search and refinement.
Society is framed as a third learning timescale beyond evolution and individual experience.

Robots, Real-World Data, and Future Directions

When text runs out, many expect robots and embodied agents to generate new data via experiments, though real-world trials are slower and failures costlier.
Ideas include fleets of cheap, robust robots sharing experience, evolutionary search over architectures, and multi-agent AI systems that talk to each other and perhaps develop their own “languages.”

AGI, Hype, and Limits of LLMs

Some insist LLMs do not work like brains and over-analogizing is harmful or hype-inducing; others say biological inspiration is still useful despite limited understanding.
There is disagreement over whether we are “eons away” from passing meaningful Turing tests or already close with focused fine-tuning.
Several worry about overhype leading to another AI winter, urging focus on realistic, non-AGI applications.

Related topics