2024-07-21

The data that powers AI is disappearing fast

Consent, Terms of Service, and Expectations

Many argue there was never real consent for AI training: uploaders to platforms (YouTube, Reddit, etc.) did not knowingly agree to having faces, voices, and styles used to train powerful generative models.
Others counter that users “consented” via ToS and third‑party doctrine: posting publicly means no expectation of privacy and platforms can pass data on.
A substantial subthread stresses informed consent: people in 2010–2015 could not realistically foresee deepfakes or style/voice cloning, so broad “future uses” clauses feel illegitimate.
There’s disagreement whether this moral critique will carry legal weight: some think courts will uphold ToS; others highlight contract invalidation and changing context.

Copyright, Fair Use, and What Training “Is”

One camp insists training is non‑infringing “doing math”: models store parameters, not works; reproduction is rare and often guarded against.
The other camp treats training as large‑scale copying and derivative‑work creation, clearly within copyright’s scope, especially when verbatim or near‑verbatim output is demonstrated.
There’s debate over whether model weights themselves are a “copy” or whether infringement only happens at output time.
Several note existing doctrines: “substantially similar” tests, fair‑use factors, and that copyright doesn’t protect facts but does protect expression.
Legal status is described as unsettled; some point to Japan’s explicit carve‑out for machine learning, and expect divergent national rules.

Data Access, Blocking, and Centralization

More sites are using robots.txt or paywalls to block AI crawlers, partly over IP/ethics and partly because bots are technically abusive (high load, ignoring robots.txt).
Critics say calling this a “decline in consent” is misleading; it’s a new assertion of rights, not a withdrawal.
Concern: incumbents that already scraped “everything” now sit on privileged corpora, while later entrants and researchers face locked‑down data and expensive licenses (Reddit, Twitter, Getty, Elsevier, etc.).
Others argue much blocked data is low‑value; blocking may simply cause it to disappear over time, while high‑value holders will sell access.

Creators, Compensation, and Public Backlash

Many posters focus on creators being “screwed”: work, likeness, and personal data are used without consent or payment, while AI products are monetized.
Counter‑arguments: most “ordinary” people don’t earn from IP and gain more from cheap tools; creators were already exploited by distributors.
Several warn that “move fast and break things” scraping is destroying public support for AI and will invite harsher regulation, especially in places like the EU.

Synthetic Data and Future Directions

Some expect synthetic or self‑generated data to become central, reducing dependence on web scraping; others invoke “garbage in, garbage out” and limits from the data processing inequality.
Examples raised: self‑play (AlphaZero), rule‑based synthetic data, and training on structured, cleaner corpora (e.g., Wikipedia, textbooks) rather than the whole web.
A minority suggests LLM‑style web‑scale training is a dead end, predicting a shift toward models learning from raw environmental streams (audio/video/robotics) instead.

Related topics