OpenAI Pleads It Can't Make Money Without Using Copyrighted Materials for Free
Context & Process
- Article is from early 2024 and tied to a UK parliamentary inquiry that has since closed; commenters note UK political turmoil and delays mean little has progressed.
- Some emphasize the title is overstated: OpenAI’s formal position is that training on copyrighted works is already legal and should remain so, not that it needs a special retroactive exemption.
Copyright, Fair Use, and Law
- Major debate over whether training on copyrighted data is fair use:
- Pro-training side likens it to search engines, caching, libraries, and humans reading books, arguing that only outputs that reproduce protected text or images matter.
- Critics argue LLMs are built explicitly to commercialize other people’s works and often act as substitutes, which weighs against fair use, especially when outputs are near-verbatim or “in the style of” specific creators.
- Some compare to Google Books and web search; others say those products are more limited (snippets, links, non-substitute use) and so are not good precedents.
- Jurisdiction issues arise (training in one country, use in another; fair use not universal). Several lawsuits (e.g., news and music) are cited as evidence this is unresolved.
Ethics, Economics, and Impact on Creators
- Many characterize current AI as built on “piracy” or uncompensated use of scraped/torrented works; they see it as rich firms monetizing small creators’ labor.
- Concerns:
- Direct substitution (e.g., “Frank Miller-style” comics, stock image generators, AI-written books flooding markets).
- Erosion of incentives to create high-quality content.
- Cultural degradation from AI spam and hallucinations misattributed to real outlets.
- Counterviews:
- People will still follow trusted human voices.
- LLMs compress and transform data; they don’t store perfect copies.
- Some creators would keep producing regardless of money.
Policy, Reform, and Possible Compromises
- One camp wants strong enforcement: pay for training data or “don’t exist as a business.”
- Another fears that banning training on copyrighted works would over-expand copyright, hurt open models, and slow Western AI relative to less IP-respecting states.
- Proposed alternatives:
- Systemic licensing or revenue share for creators (various splits suggested).
- Focus on sample-efficient models and truly copyright-free/public datasets.
- Limit outputs (prevent verbatim or substitutive content) rather than training itself.
- Broader copyright reform: shorter terms, fees to renew, healthier public domain.
Broader Reflections
- Some see this as proof AI should be treated as a public good, not proprietary infrastructure.
- Others doubt LLMs’ long-term importance and warn of rewriting copyright for a technology that might be a dead end.