A major AI training data set contains millions of examples of personal data

Legal status of LLM training under GDPR and similar laws

  • Several commenters argue that no current large LLM provider is truly GDPR-compliant, mainly because:
    • Explicit, purpose-specific consent for training is rarely obtained.
    • GDPR requires the ability to revoke consent and request erasure, which clashes with the lack of effective “machine unlearning,” especially for open-weight models.
  • Others note GDPR has “reasonableness / feasibility / state of the art” clauses that may temper strict obligations for LLMs versus, say, social networks.
  • Mistral is mentioned as EU-based but seen as opaque about training data; unclear if they are genuinely compliant.
  • Some see GDPR, DSA, AI Act etc. as anti-growth and fear they will drive AI development to China; others counter that tech companies simply haven’t bothered to invest in compliance or ethics.

Enforcement, jurisdiction, and corporate behavior

  • Discussion of 4% global revenue fines and data protection authorities’ ability to act without individual lawsuits.
  • Historical examples (Clearview, Stability, Meta, Uber, Airbnb) fuel skepticism that enforcement will be strong enough to change behavior; firms may treat fines as a cost of doing business or avoid jurisdictions.
  • Concern that if every EU company hosting open-weight models is treated as a data controller, it could chill AI use in the EU.

Public data, consent, and “victim blaming”

  • One side: anything posted publicly (LinkedIn, blogs, image hosts) is effectively fair game; people should know by now the internet is not private.
  • Counterpoint: this shifts blame from corporations to individuals; many uploads are:
    • Non-technical users who didn’t foresee AI training.
    • Content exposed via misconfigurations or platform decisions.
    • Data about you posted by others (schools, relatives, companies).
  • Debate over whether it’s reasonable to expect people to anticipate genAI use decades later.

Harms, “ID theft,” and accountability

  • Strong frustration that data misuse and breaches rarely bring serious consequences.
  • Some call for criminal liability for executives, asset seizure, or extremely harsh penalties; others implicitly question feasibility.
  • Semantic debate: calling it “ID theft” versus “bank fraud” shifts responsibility between individuals and institutions.

What the dataset actually contains

  • Clarification that the dataset is primarily (text, URL) pairs, i.e., links to personal data, not the files themselves.
  • Some argue this is a legal and practical distinction (takedowns, CSAM liability); others see it as a distinction without a difference for training and privacy harm.
  • Debate over whether URLs themselves count as PII, since they often uniquely identify a person.