OpenAI says it has evidence DeepSeek used its model to train competitor

Irony and perceived hypocrisy

  • Many see it as “thief cries thief”: OpenAI scraped the internet (often against ToS and copyrights) to train its models, then complains when someone allegedly trains on its outputs.
  • Several argue training on LLM output is at least as ethical as training on unconsenting human creators; some say it’s more ethical because it targets a giant company rather than individuals.

Legal, ToS, and copyright debates

  • Distinction drawn between:
    • Scraping publicly available web data (copyright and ToS issues, but diffuse and hard to enforce), and
    • Systematically using a paid API in violation of explicit terms to distill a competing model.
  • However, OpenAI’s own fair‑use arguments in the New York Times case (“public data is fair game”) undercut a hard IP stance against DeepSeek.
  • Enforcement is questioned: DeepSeek is Chinese, models are mirrored globally, LLM outputs aren’t copyrightable in the US, and remedies beyond cutting API access seem unclear.

How DeepSeek could have used OpenAI

  • Two main theories:
    • Straightforward API distillation: pay for access, generate large reasoning datasets, then train cheaper models.
    • Indirect ingestion: public datasets of ChatGPT conversations (e.g. ShareGPT) or third‑party services that already distilled OpenAI models.
  • Some are skeptical OpenAI has strong evidence beyond “suspicious” traffic; others note repeated instances of models calling themselves “ChatGPT” as suggestive but not conclusive.

Technical significance and skepticism

  • DeepSeek’s contributions seen as:
    • Massive efficiency gains (Mixture‑of‑Experts, compressed KV cache, cheaper RL reasoning layer),
    • A strong open‑weights reasoning model (R1) that is competitive with frontier systems on many tasks.
  • Pushback: $5–6M was only the final run, not full R&D; quality is uneven in some domains; and if much of the “reasoning” is derived from o1/4o, the breakthrough is partly piggybacking on earlier expensive work.

Economic and competitive implications

  • If a frontier model can be approximated via API‑driven distillation and clever training, OpenAI’s “compute moat” shrinks dramatically.
  • That implies:
    • Lower sustainable prices and margins,
    • More competition from smaller labs and open‑weights models,
    • Pressure on Nvidia’s “sell shovels in a gold rush” narrative, even if GPU demand remains high.

Geopolitics, bans, and “national security”

  • Many expect US policymakers and incumbents to frame DeepSeek as a security and IP threat and push for restrictions, TikTok‑style.
  • Others argue bans would mostly hurt US competitiveness while the rest of the world adopts cheap Chinese or open‑weights AI.