2024-12-14

Ilya Sutskever NeurIPS talk [video]

Peak data & limits of current scaling

Multiple commenters focus on the claim that “pre‑training as we know it will end” because we’ve hit “peak data.”
Some see this as an important public acknowledgment that increasing model size + internet-scale data no longer guarantees easy gains.
Others argue we haven’t exhausted what can be learned from existing data; current methods are inefficient at extracting knowledge.

Future training data sources

Suggestions include proprietary corpora (e.g., news, books, pharma, energy, internal codebases) where owners can sidestep copyright issues.
Ideas for new data generation: robots in the real world, continuous learning from users, self‑driving logs, surveillance video, XR/smart glasses, personal telemetry (keylogging, screenshots, etc.).
Some propose large-scale book scanning or reviving old digitization projects.
Concern that many remaining rich datasets are locked in commercial silos and will stay closed.

Synthetic data: usefulness debated

One camp claims synthetic datasets are mostly useless beyond narrow cases; better to re‑use real data.
Others counter that major labs report strong gains from synthetic data and that the question is unsettled.
It’s noted that the talk itself is skeptical about synthetic data, but commenters say he may be wrong.

Domain‑specific models and expert work

Lively thread on “state law LLMs” and narrow experts:
- Supporters think curated, smaller domains (law, proprietary code, niche languages) can yield near‑expert models and commoditize expertise, reducing demand for specialists.
- Critics argue law in particular depends on real‑world context, messy incentives, and high stakes; LLM‑grade answers are risky when errors are costly.
- Parallel drawn to code: LLMs already help non‑experts, but their outputs still need human review.

Reasoning, agents, and unpredictability

Discussion on “agentic intelligence” as models that set goals, plan, and act autonomously, versus today’s answer‑only systems.
Some agree with the claim that “more reasoning is more unpredictable,” linking useful reasoning to non‑obvious, hard‑to‑anticipate outputs.
Others push back, saying reasoning is in principle deterministic; unpredictability is about our limited ability to follow it.

Self‑awareness and meaning

Extended debate on whether current models are “self‑aware” in any meaningful sense:
- One side points to models’ ability to talk about themselves and adapt behavior as trivial self‑awareness.
- The other insists this is just pattern completion from instruction tuning, with no genuine intent or inner experience.
Philosophical arguments invoke the Chinese room, “theory of mind,” and whether meaning exists without observers.

Biology analogies & brain/body scaling

Several criticize references to “neurons” and brain–body mass ratios:
- Biological neurons are biophysically very different from transformer units.
- Brain/body ratio is a noisy correlate of intelligence; examples like birds or ants complicate simple scaling stories.
Others defend loose analogy as historically useful inspiration, even if not biologically faithful.

Talk quality, context, and hype

Many find the talk underwhelming or “fluffy,” saying it offered little new to people following the field and leaned on grand, speculative tones.
Clarification: this was a NeurIPS “Test of Time” award talk about a 2014 paper, partly retrospective rather than a new technical result.
Some note a pattern of overly optimistic timelines from prominent figures, attributing this partly to fundraising incentives.
Broader concern that NeurIPS and AI discourse are increasingly dominated by “bros,” grifters, and hype, overshadowing careful research.

Ethics, environment, and resource analogies

The “internet as oil” metaphor is read by some as an admission of extractive business models.
Environmental worries surface around compute and data center water use (“boiling lakes”).
A few raise the prospect that early powerful AIs will effectively be slaves and warn about delayed recognition of their moral status.

Related topics