2025-07-30

A major AI training data set contains millions of examples of personal data

Legal status of LLM training under GDPR and similar laws

Several commenters argue that no current large LLM provider is truly GDPR-compliant, mainly because:
- Explicit, purpose-specific consent for training is rarely obtained.
- GDPR requires the ability to revoke consent and request erasure, which clashes with the lack of effective “machine unlearning,” especially for open-weight models.
Others note GDPR has “reasonableness / feasibility / state of the art” clauses that may temper strict obligations for LLMs versus, say, social networks.
Mistral is mentioned as EU-based but seen as opaque about training data; unclear if they are genuinely compliant.
Some see GDPR, DSA, AI Act etc. as anti-growth and fear they will drive AI development to China; others counter that tech companies simply haven’t bothered to invest in compliance or ethics.

Enforcement, jurisdiction, and corporate behavior

Discussion of 4% global revenue fines and data protection authorities’ ability to act without individual lawsuits.
Historical examples (Clearview, Stability, Meta, Uber, Airbnb) fuel skepticism that enforcement will be strong enough to change behavior; firms may treat fines as a cost of doing business or avoid jurisdictions.
Concern that if every EU company hosting open-weight models is treated as a data controller, it could chill AI use in the EU.

Public data, consent, and “victim blaming”

One side: anything posted publicly (LinkedIn, blogs, image hosts) is effectively fair game; people should know by now the internet is not private.
Counterpoint: this shifts blame from corporations to individuals; many uploads are:
- Non-technical users who didn’t foresee AI training.
- Content exposed via misconfigurations or platform decisions.
- Data about you posted by others (schools, relatives, companies).
Debate over whether it’s reasonable to expect people to anticipate genAI use decades later.

Harms, “ID theft,” and accountability

Strong frustration that data misuse and breaches rarely bring serious consequences.
Some call for criminal liability for executives, asset seizure, or extremely harsh penalties; others implicitly question feasibility.
Semantic debate: calling it “ID theft” versus “bank fraud” shifts responsibility between individuals and institutions.

What the dataset actually contains

Clarification that the dataset is primarily (text, URL) pairs, i.e., links to personal data, not the files themselves.
Some argue this is a legal and practical distinction (takedowns, CSAM liability); others see it as a distinction without a difference for training and privacy harm.
Debate over whether URLs themselves count as PII, since they often uniquely identify a person.

Related topics