2024-07-11

GitHub Copilot is not infringing your copyright (2021)

Legal status of training and output

Some argue current EU copyright law explicitly allows scraping and text/data mining of publicly available code, regardless of license, as long as copies aren’t themselves distributed; thus training Copilot is likely legal.
Others counter that exceptions must not “conflict with normal exploitation” or “unreasonably prejudice” rightsholders; they think Copilot does both by laundering licenses and competing with programmers.
Debate over whether LLMs/AI are more like:
- A person reading and being inspired by works (often framed as legal), or
- A mass-scale copying machine / scanner or lossy compressor (potentially infringing, especially at corporate/commercial scale).

Copyleft, GPL, and derivative works

One camp insists model weights are derivative works of GPL/copyleft code and must be GPL’d; or that generated GPL-like code remains GPL.
Others reply that:
- Scraping is legal regardless of license.
- Copyright exceptions (e.g., text/data mining, fair use) can’t be overridden by GPL.
- Only outputs that are substantially similar to specific GPL code would infringe.
Concern that Copilot lets users unknowingly violate GPL or other restrictive licenses when it emits recognizable fragments.

Are weights and outputs “derivative works”?

Some see weights as analogous to lossy compression of the training set; if so, outputs based on them should inherit licenses.
Others argue that statistics, n‑gram frequencies, or generative functions derived from works aren’t themselves derivative works under current law.
Unclear line between harmless statistics and models capable of reconstructing large chunks of originals.

Ethics, power, and fairness

Many feel that using community code to build a paid product without compensation or attribution is morally wrong, even if technically legal.
Some free‑software advocates emphasize that copyleft was about user freedom, not creators’ revenue, and that shortening copyright terms globally might be preferable to stretching copyright to block AI.

Technical behavior and risk

Multiple comments note that LLMs can and do occasionally regurgitate verbatim code or text; this undermines “it only learns abstractions” defenses.
Others say such cases are rare, often require adversarial prompts, and providers now add filters to reduce verbatim output.
Some organizations reportedly ban generative AI tools over unresolved IP risk.

Trust in the article and institutions

A few commenters question the neutrality of the article’s author due to later employment at GitHub and links to organizations funded by large tech companies.
Others emphasize the author’s prior work on copyright reform as evidence of legal expertise, not corporate loyalty.

Related topics