An embarrassingly simple approach to recover unlearned knowledge for LLMs
Overview of the result
- Paper claims: model “unlearning” is often implemented as small weight updates that suppress specific knowledge while preserving overall performance.
- Discussion consensus: quantization can effectively erase those tiny deltas, making the “forgotten” knowledge accessible again in the quantized model.
- Several commenters liken this to removing a thin layer of censorship rather than erasing the underlying memory.
Unlearning vs. guardrails
- Distinction:
- Unlearning = trying to make the model truly forget certain facts via weight changes.
- Guardrails = instructing the model not to say certain things, while the knowledge remains.
- Multiple comments argue most current “unlearning” is closer to “guardrails in weights” – lowering the probability of certain outputs.
- From an information-theoretic angle, some argue that if information can be recovered by any process (like quantization or clever prompting), it was never really removed.
Threat models, safety, and misuse
- Concern: if unlearning is fragile, models “cleaned” of harmful or copyrighted content may still leak it via quantization or other transformations.
- Specific risks mentioned: instructions for drugs, poisons, explosives, and other illegal activities.
- Counterpoint: much of this information is already widely available (e.g., manuals, Wikipedia), and regulators often fixate on AI while ignoring existing channels.
- Some expect future “quantization-robust unlearning,” but others think quantization is just one of many ways to undo weak unlearning.
Copyright, data ownership, and ethics
- Long subthread criticizes LLMs as extracting value from a public good (the internet) without compensating most creators, especially small ones.
- Others compare this to humans, teachers, or encyclopedias learning from and reselling knowledge, arguing the key issue is verbatim copying and IP misuse, not training itself.
- There is disagreement on whether current practices are “theft” or transformative fair use; courts and new laws are seen as inevitable.
Broader AI debates
- Some see this as more evidence that we’re just hacking censorship layers onto “spicy autocomplete,” and that speculative AGI/“superalignment” discourse distracts from present harms.
- Others argue long-term transformative impact of AI is still likely, analogous (positively or negatively) to past overhyped technologies like 3D printing.
Paper quality and language
- One commenter criticizes the English of the preprint; others respond that it’s just an arXiv draft, that the writing is acceptable, and that attacking non-native English is unfair or racist.