A major AI training data set contains millions of examples of personal data
Legal status of LLM training under GDPR and similar laws
- Several commenters argue that no current large LLM provider is truly GDPR-compliant, mainly because:
- Explicit, purpose-specific consent for training is rarely obtained.
- GDPR requires the ability to revoke consent and request erasure, which clashes with the lack of effective “machine unlearning,” especially for open-weight models.
- Others note GDPR has “reasonableness / feasibility / state of the art” clauses that may temper strict obligations for LLMs versus, say, social networks.
- Mistral is mentioned as EU-based but seen as opaque about training data; unclear if they are genuinely compliant.
- Some see GDPR, DSA, AI Act etc. as anti-growth and fear they will drive AI development to China; others counter that tech companies simply haven’t bothered to invest in compliance or ethics.
Enforcement, jurisdiction, and corporate behavior
- Discussion of 4% global revenue fines and data protection authorities’ ability to act without individual lawsuits.
- Historical examples (Clearview, Stability, Meta, Uber, Airbnb) fuel skepticism that enforcement will be strong enough to change behavior; firms may treat fines as a cost of doing business or avoid jurisdictions.
- Concern that if every EU company hosting open-weight models is treated as a data controller, it could chill AI use in the EU.
Public data, consent, and “victim blaming”
- One side: anything posted publicly (LinkedIn, blogs, image hosts) is effectively fair game; people should know by now the internet is not private.
- Counterpoint: this shifts blame from corporations to individuals; many uploads are:
- Non-technical users who didn’t foresee AI training.
- Content exposed via misconfigurations or platform decisions.
- Data about you posted by others (schools, relatives, companies).
- Debate over whether it’s reasonable to expect people to anticipate genAI use decades later.
Harms, “ID theft,” and accountability
- Strong frustration that data misuse and breaches rarely bring serious consequences.
- Some call for criminal liability for executives, asset seizure, or extremely harsh penalties; others implicitly question feasibility.
- Semantic debate: calling it “ID theft” versus “bank fraud” shifts responsibility between individuals and institutions.
What the dataset actually contains
- Clarification that the dataset is primarily (text, URL) pairs, i.e., links to personal data, not the files themselves.
- Some argue this is a legal and practical distinction (takedowns, CSAM liability); others see it as a distinction without a difference for training and privacy harm.
- Debate over whether URLs themselves count as PII, since they often uniquely identify a person.