2024-06-01

LLMs aren't "trained on the internet" anymore

Scope of Training Data

Many argue the title “not trained on the internet” is clickbait; models still heavily depend on web data, but increasingly also on private and curated datasets.
Discussion highlights RLHF, expert labeling, and synthetic data as growing components, with usage logs (e.g., ChatGPT prompts) feeding reward models and fine-tuning.
Some note that high‑quality, task‑specific data and RLHF can outperform simple model scaling, citing small instruction‑tuned models rivaling much larger baselines.

History: Expert Systems vs LLMs

Several compare the new expert‑curated datasets to old expert systems.
One side: expert systems failed due to the “knowledge acquisition bottleneck” and brittleness; relying on experts again risks repeating that.
Counterpoint: today’s data generation and sensing are vastly larger, and LLMs are adaptive probabilistic models, not rigid rule bases; expert data is a complement, not a return to pure expert systems.

Capabilities, Limits, and Hallucinations

Critics say progress on hallucination reduction and genuine “understanding” is limited; LLMs mainly excel at translation and style mimicry, not deep expertise.
Others argue LLMs already force a rethink of “intelligence,” and even “narrow superhuman” performance can be economically valuable.
There is debate over whether current methods can scale to something like AGI or if we’re in a massive “whack‑a‑mole” of patching failures.

Ownership, Economics, and Labor

Some question why platforms like Reddit and Stack Overflow licensed their data cheaply instead of building their own models; responses cite lack of capital, risk, and unclear profitability.
Concerns raised about data “theft,” Creative Commons ambiguity, and the power imbalance between platforms, users, and AI labs.
Discussion of labeling workforces: marketing emphasizes PhDs and poets, but many expect large pools of relatively low‑paid workers doing repetitive data and annotation tasks.

Privacy, Secrecy, and “Open” Claims

Participants note that major labs keep exact training datasets secret for competitive and financial reasons, despite “open” branding.
Some see a future where proprietary, high‑quality datasets become the main differentiator, increasing incentives to keep them closed.

Related topics