Microsoft Probing If DeepSeek-Linked Group Improperly Obtained OpenAI Data
Perceived Hypocrisy and Irony
- Many see Microsoft/OpenAI’s stance as blatantly hypocritical: mass-scraping the public web (often against sites’ ToS) is framed as “innovation,” but training on OpenAI outputs is suddenly “improper.”
- Commenters note OpenAI’s own argument that AI outputs aren’t copyrightable; trying to retroactively treat them as protected IP is viewed as “having it both ways.”
- The language of “exfiltration” from a paid API is mocked as scare-terminology for “using the service at scale.”
Law, ToS, and Enforcement Limits
- Distinction drawn between copyright (weak for AI outputs) and contract/ToS violations (potential civil breach, not crime).
- Debate on whether website ToS bind scrapers who never explicitly agreed; some argue OpenAI itself is only constrained by copyright, not third‑party ToS.
- People doubt any meaningful legal remedy against a Chinese company, especially when model weights are openly released; at best, the US could try to restrict DeepSeek services or US‑based hosts.
US–China Politics and Monopolies
- Thread ties this strongly to US–China tech rivalry: an American “pro‑business” (really pro‑monopoly) administration is expected to weaponize regulation against a cheaper Chinese competitor.
- Some predict export controls, app‑store bans, tariffs, or KYC rules aimed at “frontier models” benefiting China.
- Others argue such policies backfire long‑term, pushing China to self-sufficiency in GPUs and AI.
Distillation, Training Data, and Model Identity
- General consensus that distilling from another model’s outputs is technically standard and likely legal fair use, aside from ToS.
- People note vast public ChatGPT transcript datasets (e.g., ShareGPT) already contaminating training data.
- DeepSeek and other models sometimes claim to be “ChatGPT”; many attribute this to dataset contamination and weak self‑identity rather than direct weight theft.
- A minority speculates about more serious data access (internal OpenAI logs, labeled datasets) but flags this as unproven.
Market Dynamics and Microsoft/OpenAI Strategy
- Deep skepticism that Microsoft/OpenAI have any real moat if another lab can match performance cheaply.
- Some see this probe as an attempt to create legal uncertainty and scare enterprises away from using DeepSeek, rather than out‑competing on quality or price.