LLMs aren't "trained on the internet" anymore

Scope of Training Data

  • Many argue the title “not trained on the internet” is clickbait; models still heavily depend on web data, but increasingly also on private and curated datasets.
  • Discussion highlights RLHF, expert labeling, and synthetic data as growing components, with usage logs (e.g., ChatGPT prompts) feeding reward models and fine-tuning.
  • Some note that high‑quality, task‑specific data and RLHF can outperform simple model scaling, citing small instruction‑tuned models rivaling much larger baselines.

History: Expert Systems vs LLMs

  • Several compare the new expert‑curated datasets to old expert systems.
  • One side: expert systems failed due to the “knowledge acquisition bottleneck” and brittleness; relying on experts again risks repeating that.
  • Counterpoint: today’s data generation and sensing are vastly larger, and LLMs are adaptive probabilistic models, not rigid rule bases; expert data is a complement, not a return to pure expert systems.

Capabilities, Limits, and Hallucinations

  • Critics say progress on hallucination reduction and genuine “understanding” is limited; LLMs mainly excel at translation and style mimicry, not deep expertise.
  • Others argue LLMs already force a rethink of “intelligence,” and even “narrow superhuman” performance can be economically valuable.
  • There is debate over whether current methods can scale to something like AGI or if we’re in a massive “whack‑a‑mole” of patching failures.

Ownership, Economics, and Labor

  • Some question why platforms like Reddit and Stack Overflow licensed their data cheaply instead of building their own models; responses cite lack of capital, risk, and unclear profitability.
  • Concerns raised about data “theft,” Creative Commons ambiguity, and the power imbalance between platforms, users, and AI labs.
  • Discussion of labeling workforces: marketing emphasizes PhDs and poets, but many expect large pools of relatively low‑paid workers doing repetitive data and annotation tasks.

Privacy, Secrecy, and “Open” Claims

  • Participants note that major labs keep exact training datasets secret for competitive and financial reasons, despite “open” branding.
  • Some see a future where proprietary, high‑quality datasets become the main differentiator, increasing incentives to keep them closed.