Microsoft Probing If DeepSeek-Linked Group Improperly Obtained OpenAI Data

Perceived Hypocrisy and Irony

  • Many see Microsoft/OpenAI’s stance as blatantly hypocritical: mass-scraping the public web (often against sites’ ToS) is framed as “innovation,” but training on OpenAI outputs is suddenly “improper.”
  • Commenters note OpenAI’s own argument that AI outputs aren’t copyrightable; trying to retroactively treat them as protected IP is viewed as “having it both ways.”
  • The language of “exfiltration” from a paid API is mocked as scare-terminology for “using the service at scale.”

Law, ToS, and Enforcement Limits

  • Distinction drawn between copyright (weak for AI outputs) and contract/ToS violations (potential civil breach, not crime).
  • Debate on whether website ToS bind scrapers who never explicitly agreed; some argue OpenAI itself is only constrained by copyright, not third‑party ToS.
  • People doubt any meaningful legal remedy against a Chinese company, especially when model weights are openly released; at best, the US could try to restrict DeepSeek services or US‑based hosts.

US–China Politics and Monopolies

  • Thread ties this strongly to US–China tech rivalry: an American “pro‑business” (really pro‑monopoly) administration is expected to weaponize regulation against a cheaper Chinese competitor.
  • Some predict export controls, app‑store bans, tariffs, or KYC rules aimed at “frontier models” benefiting China.
  • Others argue such policies backfire long‑term, pushing China to self-sufficiency in GPUs and AI.

Distillation, Training Data, and Model Identity

  • General consensus that distilling from another model’s outputs is technically standard and likely legal fair use, aside from ToS.
  • People note vast public ChatGPT transcript datasets (e.g., ShareGPT) already contaminating training data.
  • DeepSeek and other models sometimes claim to be “ChatGPT”; many attribute this to dataset contamination and weak self‑identity rather than direct weight theft.
  • A minority speculates about more serious data access (internal OpenAI logs, labeled datasets) but flags this as unproven.

Market Dynamics and Microsoft/OpenAI Strategy

  • Deep skepticism that Microsoft/OpenAI have any real moat if another lab can match performance cheaply.
  • Some see this probe as an attempt to create legal uncertainty and scare enterprises away from using DeepSeek, rather than out‑competing on quality or price.