2024-06-13

Uncensor any LLM with abliteration

Interpretation of Asimov’s Three Laws

Long subthread debates whether the Three Laws were intended as parody, satire, or sincere optimism.
Consensus: “parody” is too strong; stories show the laws as flawed and tension-generating rather than purely positive.
Some argue you can’t assert authorial intent without direct evidence; others say artistic interpretation is inherently plural.

What “Abliteration” Does

Seen as a specific form of “representation engineering”: identify a “refusal direction” in internal activations and project it out.
Two modes discussed:
- Inference-time intervention.
- Permanent weight orthogonalization.
Parallels drawn to control/steering vectors; framed as “brain-chipping” models without full retraining.

Effectiveness and Model Quality

Mixed empirical reports:
- Some users say abliterated Llama 3/Qwen2 behave like normal instruct models but without refusals, including on extreme content.
- Others report “lobotomy”-like degradation: increased perplexity, broken stop tokens, self-talk, and lower overall quality.
One commenter with benchmarks claims negligible capability loss when done carefully; suggests some failures are implementation errors.

Open Weights, Safety, and PR

Strong view that this only matters when weights are downloadable; hosted APIs can’t be modified this way.
Several argue this is exactly why large vendors avoid releasing weights: once out, “safety” alignment can be stripped trivially.
Many emphasize that corporate “safety” is mostly brand and legal risk management, not global risk minimization.

Censorship, Free Speech, and Use Policy

Sharp split between:
- Those who see uncensoring as obviously dangerous (bombs, bioweapons, CSAM-like content, election manipulation, harassment at scale).
- Those who see LLMs as glorified search/BS generators and think censorship is paternalistic, inconsistent with free-speech norms, and largely ineffective given Google/books.
Meta’s Llama 3 license and its prohibition on “allowing misuse” is discussed; some treat it as a real contractual risk, others as practically unenforceable.
Some argue models trained on public web data have weak moral authority to impose strong downstream ToS.

Jailbreaks vs Weight Editing

Multiple people note prompt-based jailbreaks already bypass refusals (e.g., “legal department” or safety-framed prompts), so abliteration mainly lowers the barrier.
Others counter that making uncensoring require technical skill is still useful harm reduction; current safety is “too easy to jailbreak.”

Wider LLM Safety Frictions

Many concrete frustrations with overbroad refusals (regex for slur filtering, AWS/Gemini refusing basic auth or nuclear engineering questions).
Some argue safety should be implemented as a separate output filter, not baked into the core model.
Concern raised that as models become central infrastructure (education, policy, finance), having a few corporations define global speech norms is itself a major risk.

Related topics