Uncensor any LLM with abliteration
Interpretation of Asimov’s Three Laws
- Long subthread debates whether the Three Laws were intended as parody, satire, or sincere optimism.
- Consensus: “parody” is too strong; stories show the laws as flawed and tension-generating rather than purely positive.
- Some argue you can’t assert authorial intent without direct evidence; others say artistic interpretation is inherently plural.
What “Abliteration” Does
- Seen as a specific form of “representation engineering”: identify a “refusal direction” in internal activations and project it out.
- Two modes discussed:
- Inference-time intervention.
- Permanent weight orthogonalization.
- Parallels drawn to control/steering vectors; framed as “brain-chipping” models without full retraining.
Effectiveness and Model Quality
- Mixed empirical reports:
- Some users say abliterated Llama 3/Qwen2 behave like normal instruct models but without refusals, including on extreme content.
- Others report “lobotomy”-like degradation: increased perplexity, broken stop tokens, self-talk, and lower overall quality.
- One commenter with benchmarks claims negligible capability loss when done carefully; suggests some failures are implementation errors.
Open Weights, Safety, and PR
- Strong view that this only matters when weights are downloadable; hosted APIs can’t be modified this way.
- Several argue this is exactly why large vendors avoid releasing weights: once out, “safety” alignment can be stripped trivially.
- Many emphasize that corporate “safety” is mostly brand and legal risk management, not global risk minimization.
Censorship, Free Speech, and Use Policy
- Sharp split between:
- Those who see uncensoring as obviously dangerous (bombs, bioweapons, CSAM-like content, election manipulation, harassment at scale).
- Those who see LLMs as glorified search/BS generators and think censorship is paternalistic, inconsistent with free-speech norms, and largely ineffective given Google/books.
- Meta’s Llama 3 license and its prohibition on “allowing misuse” is discussed; some treat it as a real contractual risk, others as practically unenforceable.
- Some argue models trained on public web data have weak moral authority to impose strong downstream ToS.
Jailbreaks vs Weight Editing
- Multiple people note prompt-based jailbreaks already bypass refusals (e.g., “legal department” or safety-framed prompts), so abliteration mainly lowers the barrier.
- Others counter that making uncensoring require technical skill is still useful harm reduction; current safety is “too easy to jailbreak.”
Wider LLM Safety Frictions
- Many concrete frustrations with overbroad refusals (regex for slur filtering, AWS/Gemini refusing basic auth or nuclear engineering questions).
- Some argue safety should be implemented as a separate output filter, not baked into the core model.
- Concern raised that as models become central infrastructure (education, policy, finance), having a few corporations define global speech norms is itself a major risk.