2025-01-29

OpenAI says it has evidence DeepSeek used its model to train competitor

Irony and perceived hypocrisy

Many see it as “thief cries thief”: OpenAI scraped the internet (often against ToS and copyrights) to train its models, then complains when someone allegedly trains on its outputs.
Several argue training on LLM output is at least as ethical as training on unconsenting human creators; some say it’s more ethical because it targets a giant company rather than individuals.

Legal, ToS, and copyright debates

Distinction drawn between:
- Scraping publicly available web data (copyright and ToS issues, but diffuse and hard to enforce), and
- Systematically using a paid API in violation of explicit terms to distill a competing model.
However, OpenAI’s own fair‑use arguments in the New York Times case (“public data is fair game”) undercut a hard IP stance against DeepSeek.
Enforcement is questioned: DeepSeek is Chinese, models are mirrored globally, LLM outputs aren’t copyrightable in the US, and remedies beyond cutting API access seem unclear.

How DeepSeek could have used OpenAI

Two main theories:
- Straightforward API distillation: pay for access, generate large reasoning datasets, then train cheaper models.
- Indirect ingestion: public datasets of ChatGPT conversations (e.g. ShareGPT) or third‑party services that already distilled OpenAI models.
Some are skeptical OpenAI has strong evidence beyond “suspicious” traffic; others note repeated instances of models calling themselves “ChatGPT” as suggestive but not conclusive.

Technical significance and skepticism

DeepSeek’s contributions seen as:
- Massive efficiency gains (Mixture‑of‑Experts, compressed KV cache, cheaper RL reasoning layer),
- A strong open‑weights reasoning model (R1) that is competitive with frontier systems on many tasks.
Pushback: $5–6M was only the final run, not full R&D; quality is uneven in some domains; and if much of the “reasoning” is derived from o1/4o, the breakthrough is partly piggybacking on earlier expensive work.

Economic and competitive implications

If a frontier model can be approximated via API‑driven distillation and clever training, OpenAI’s “compute moat” shrinks dramatically.
That implies:
- Lower sustainable prices and margins,
- More competition from smaller labs and open‑weights models,
- Pressure on Nvidia’s “sell shovels in a gold rush” narrative, even if GPU demand remains high.

Geopolitics, bans, and “national security”

Many expect US policymakers and incumbents to frame DeepSeek as a security and IP threat and push for restrictions, TikTok‑style.
Others argue bans would mostly hurt US competitiveness while the rest of the world adopts cheap Chinese or open‑weights AI.

Related topics