OpenAI Pleads It Can't Make Money Without Using Copyrighted Materials for Free

Context & Process

  • Article is from early 2024 and tied to a UK parliamentary inquiry that has since closed; commenters note UK political turmoil and delays mean little has progressed.
  • Some emphasize the title is overstated: OpenAI’s formal position is that training on copyrighted works is already legal and should remain so, not that it needs a special retroactive exemption.

Copyright, Fair Use, and Law

  • Major debate over whether training on copyrighted data is fair use:
    • Pro-training side likens it to search engines, caching, libraries, and humans reading books, arguing that only outputs that reproduce protected text or images matter.
    • Critics argue LLMs are built explicitly to commercialize other people’s works and often act as substitutes, which weighs against fair use, especially when outputs are near-verbatim or “in the style of” specific creators.
  • Some compare to Google Books and web search; others say those products are more limited (snippets, links, non-substitute use) and so are not good precedents.
  • Jurisdiction issues arise (training in one country, use in another; fair use not universal). Several lawsuits (e.g., news and music) are cited as evidence this is unresolved.

Ethics, Economics, and Impact on Creators

  • Many characterize current AI as built on “piracy” or uncompensated use of scraped/tor­rented works; they see it as rich firms monetizing small creators’ labor.
  • Concerns:
    • Direct substitution (e.g., “Frank Miller-style” comics, stock image generators, AI-written books flooding markets).
    • Erosion of incentives to create high-quality content.
    • Cultural degradation from AI spam and hallucinations misattributed to real outlets.
  • Counterviews:
    • People will still follow trusted human voices.
    • LLMs compress and transform data; they don’t store perfect copies.
    • Some creators would keep producing regardless of money.

Policy, Reform, and Possible Compromises

  • One camp wants strong enforcement: pay for training data or “don’t exist as a business.”
  • Another fears that banning training on copyrighted works would over-expand copyright, hurt open models, and slow Western AI relative to less IP-respecting states.
  • Proposed alternatives:
    • Systemic licensing or revenue share for creators (various splits suggested).
    • Focus on sample-efficient models and truly copyright-free/public datasets.
    • Limit outputs (prevent verbatim or substitutive content) rather than training itself.
    • Broader copyright reform: shorter terms, fees to renew, healthier public domain.

Broader Reflections

  • Some see this as proof AI should be treated as a public good, not proprietary infrastructure.
  • Others doubt LLMs’ long-term importance and warn of rewriting copyright for a technology that might be a dead end.