2024-11-04

An embarrassingly simple approach to recover unlearned knowledge for LLMs

Overview of the result

Paper claims: model “unlearning” is often implemented as small weight updates that suppress specific knowledge while preserving overall performance.
Discussion consensus: quantization can effectively erase those tiny deltas, making the “forgotten” knowledge accessible again in the quantized model.
Several commenters liken this to removing a thin layer of censorship rather than erasing the underlying memory.

Unlearning vs. guardrails

Distinction:
- Unlearning = trying to make the model truly forget certain facts via weight changes.
- Guardrails = instructing the model not to say certain things, while the knowledge remains.
Multiple comments argue most current “unlearning” is closer to “guardrails in weights” – lowering the probability of certain outputs.
From an information-theoretic angle, some argue that if information can be recovered by any process (like quantization or clever prompting), it was never really removed.

Threat models, safety, and misuse

Concern: if unlearning is fragile, models “cleaned” of harmful or copyrighted content may still leak it via quantization or other transformations.
Specific risks mentioned: instructions for drugs, poisons, explosives, and other illegal activities.
Counterpoint: much of this information is already widely available (e.g., manuals, Wikipedia), and regulators often fixate on AI while ignoring existing channels.
Some expect future “quantization-robust unlearning,” but others think quantization is just one of many ways to undo weak unlearning.

Copyright, data ownership, and ethics

Long subthread criticizes LLMs as extracting value from a public good (the internet) without compensating most creators, especially small ones.
Others compare this to humans, teachers, or encyclopedias learning from and reselling knowledge, arguing the key issue is verbatim copying and IP misuse, not training itself.
There is disagreement on whether current practices are “theft” or transformative fair use; courts and new laws are seen as inevitable.

Broader AI debates

Some see this as more evidence that we’re just hacking censorship layers onto “spicy autocomplete,” and that speculative AGI/“superalignment” discourse distracts from present harms.
Others argue long-term transformative impact of AI is still likely, analogous (positively or negatively) to past overhyped technologies like 3D printing.

Paper quality and language

One commenter criticizes the English of the preprint; others respond that it’s just an arXiv draft, that the writing is acceptable, and that attacking non-native English is unfair or racist.

Related topics