OpenAI says it has evidence DeepSeek used its model to train competitor
Irony and perceived hypocrisy
- Many see it as “thief cries thief”: OpenAI scraped the internet (often against ToS and copyrights) to train its models, then complains when someone allegedly trains on its outputs.
- Several argue training on LLM output is at least as ethical as training on unconsenting human creators; some say it’s more ethical because it targets a giant company rather than individuals.
Legal, ToS, and copyright debates
- Distinction drawn between:
- Scraping publicly available web data (copyright and ToS issues, but diffuse and hard to enforce), and
- Systematically using a paid API in violation of explicit terms to distill a competing model.
- However, OpenAI’s own fair‑use arguments in the New York Times case (“public data is fair game”) undercut a hard IP stance against DeepSeek.
- Enforcement is questioned: DeepSeek is Chinese, models are mirrored globally, LLM outputs aren’t copyrightable in the US, and remedies beyond cutting API access seem unclear.
How DeepSeek could have used OpenAI
- Two main theories:
- Straightforward API distillation: pay for access, generate large reasoning datasets, then train cheaper models.
- Indirect ingestion: public datasets of ChatGPT conversations (e.g. ShareGPT) or third‑party services that already distilled OpenAI models.
- Some are skeptical OpenAI has strong evidence beyond “suspicious” traffic; others note repeated instances of models calling themselves “ChatGPT” as suggestive but not conclusive.
Technical significance and skepticism
- DeepSeek’s contributions seen as:
- Massive efficiency gains (Mixture‑of‑Experts, compressed KV cache, cheaper RL reasoning layer),
- A strong open‑weights reasoning model (R1) that is competitive with frontier systems on many tasks.
- Pushback: $5–6M was only the final run, not full R&D; quality is uneven in some domains; and if much of the “reasoning” is derived from o1/4o, the breakthrough is partly piggybacking on earlier expensive work.
Economic and competitive implications
- If a frontier model can be approximated via API‑driven distillation and clever training, OpenAI’s “compute moat” shrinks dramatically.
- That implies:
- Lower sustainable prices and margins,
- More competition from smaller labs and open‑weights models,
- Pressure on Nvidia’s “sell shovels in a gold rush” narrative, even if GPU demand remains high.
Geopolitics, bans, and “national security”
- Many expect US policymakers and incumbents to frame DeepSeek as a security and IP threat and push for restrictions, TikTok‑style.
- Others argue bans would mostly hurt US competitiveness while the rest of the world adopts cheap Chinese or open‑weights AI.