2025-01-29

Microsoft Probing If DeepSeek-Linked Group Improperly Obtained OpenAI Data

Perceived Hypocrisy and Irony

Many see Microsoft/OpenAI’s stance as blatantly hypocritical: mass-scraping the public web (often against sites’ ToS) is framed as “innovation,” but training on OpenAI outputs is suddenly “improper.”
Commenters note OpenAI’s own argument that AI outputs aren’t copyrightable; trying to retroactively treat them as protected IP is viewed as “having it both ways.”
The language of “exfiltration” from a paid API is mocked as scare-terminology for “using the service at scale.”

Law, ToS, and Enforcement Limits

Distinction drawn between copyright (weak for AI outputs) and contract/ToS violations (potential civil breach, not crime).
Debate on whether website ToS bind scrapers who never explicitly agreed; some argue OpenAI itself is only constrained by copyright, not third‑party ToS.
People doubt any meaningful legal remedy against a Chinese company, especially when model weights are openly released; at best, the US could try to restrict DeepSeek services or US‑based hosts.

US–China Politics and Monopolies

Thread ties this strongly to US–China tech rivalry: an American “pro‑business” (really pro‑monopoly) administration is expected to weaponize regulation against a cheaper Chinese competitor.
Some predict export controls, app‑store bans, tariffs, or KYC rules aimed at “frontier models” benefiting China.
Others argue such policies backfire long‑term, pushing China to self-sufficiency in GPUs and AI.

Distillation, Training Data, and Model Identity

General consensus that distilling from another model’s outputs is technically standard and likely legal fair use, aside from ToS.
People note vast public ChatGPT transcript datasets (e.g., ShareGPT) already contaminating training data.
DeepSeek and other models sometimes claim to be “ChatGPT”; many attribute this to dataset contamination and weak self‑identity rather than direct weight theft.
A minority speculates about more serious data access (internal OpenAI logs, labeled datasets) but flags this as unproven.

Market Dynamics and Microsoft/OpenAI Strategy

Deep skepticism that Microsoft/OpenAI have any real moat if another lab can match performance cheaply.
Some see this probe as an attempt to create legal uncertainty and scare enterprises away from using DeepSeek, rather than out‑competing on quality or price.

Related topics