2024-09-03

OpenAI Pleads It Can't Make Money Without Using Copyrighted Materials for Free

Context & Process

Article is from early 2024 and tied to a UK parliamentary inquiry that has since closed; commenters note UK political turmoil and delays mean little has progressed.
Some emphasize the title is overstated: OpenAI’s formal position is that training on copyrighted works is already legal and should remain so, not that it needs a special retroactive exemption.

Copyright, Fair Use, and Law

Major debate over whether training on copyrighted data is fair use:
- Pro-training side likens it to search engines, caching, libraries, and humans reading books, arguing that only outputs that reproduce protected text or images matter.
- Critics argue LLMs are built explicitly to commercialize other people’s works and often act as substitutes, which weighs against fair use, especially when outputs are near-verbatim or “in the style of” specific creators.
Some compare to Google Books and web search; others say those products are more limited (snippets, links, non-substitute use) and so are not good precedents.
Jurisdiction issues arise (training in one country, use in another; fair use not universal). Several lawsuits (e.g., news and music) are cited as evidence this is unresolved.

Ethics, Economics, and Impact on Creators

Many characterize current AI as built on “piracy” or uncompensated use of scraped/torrented works; they see it as rich firms monetizing small creators’ labor.
Concerns:
- Direct substitution (e.g., “Frank Miller-style” comics, stock image generators, AI-written books flooding markets).
- Erosion of incentives to create high-quality content.
- Cultural degradation from AI spam and hallucinations misattributed to real outlets.
Counterviews:
- People will still follow trusted human voices.
- LLMs compress and transform data; they don’t store perfect copies.
- Some creators would keep producing regardless of money.

Policy, Reform, and Possible Compromises

One camp wants strong enforcement: pay for training data or “don’t exist as a business.”
Another fears that banning training on copyrighted works would over-expand copyright, hurt open models, and slow Western AI relative to less IP-respecting states.
Proposed alternatives:
- Systemic licensing or revenue share for creators (various splits suggested).
- Focus on sample-efficient models and truly copyright-free/public datasets.
- Limit outputs (prevent verbatim or substitutive content) rather than training itself.
- Broader copyright reform: shorter terms, fees to renew, healthier public domain.

Broader Reflections

Some see this as proof AI should be treated as a public good, not proprietary infrastructure.
Others doubt LLMs’ long-term importance and warn of rewriting copyright for a technology that might be a dead end.

Related topics