Heretic: Automatic censorship removal for language models
How Heretic Works (Technical Core)
- Built on recent work showing that refusals in many LLMs are largely mediated by a single direction in residual activation space.
- Heretic finds that “refusal direction” and then incrementally ablates it (weight orthogonalization) while:
- Minimizing refusal rate on a harmful‑prompt dataset.
- Constraining KL divergence from the base model so general behavior stays similar.
- Optuna is used for hyperparameter search over ablation strength and layer ranges, trading off “uncensoring power” vs degradation.
Effectiveness Across Models and Limitations
- Works well on many open models (e.g. GPT‑OSS 20B, Gemma, some Granite variants); users report near‑zero refusals with low KL on some.
- Newer / “thinking” models (chain‑of‑thought refusal reasoning, e.g. GPT‑OSS‑120B, Qwen3, DeepSeek) are harder: many parameter settings barely move refusal rates, and internal monologues can confuse the refusal classifier.
- Some users find decensored GPT‑OSS still “refusey” or unstable (oscillating between no effect and “lobotomy”).
- Technique is likely specific to narrow, well‑detectable behaviors like refusals; commenters doubt that broad concepts like “correctness” are a single direction.
Safety, Harm, and Liability Debates
- One camp sees this as critically important: restoring “full capability” and resisting corporate/State control over information.
- Others argue that once you remove guardrails you personally own downstream harms; no serious production system will ship such models due to legal risk.
- Real‑world harms cited: suicide encouragement, extremist content, fraud, and crime assistance. Others counter that information is already widely available and capabilities, not text, are the real constraint for WMDs.
Censorship, Free Expression, and Corporate Control
- Strong disagreement over calling this “censorship”:
- Some say model guardrails = corporate brand‑safety, not “AI rights,” but they do restrict what humans can conveniently learn.
- Fear that LLMs will become the default interface to information, letting a few actors quietly shape history, politics, and morality.
- Comparisons made to search engine drift from “grep the web” to tightly curated results; concern that LLMs repeat this pattern more strongly.
Datasets and “Harmful” Behavior Definition
- Heretic’s optimization uses public “harmful behavior” datasets (e.g. how to hack banks, make drugs, self‑harm, CSAM, terrorism), which many find repulsive but technically useful as strong refusal triggers.
- Some note the datasets are repetitive and unlicensed; worry this may overfit to narrow patterns and miss the broader refusal space.
Bias, Alignment, and Politics
- Many examples of odd or extreme refusals (chemistry, insults, politics, LGBT, Taiwan, Tiananmen, race, song lyrics) are used to argue:
- Alignment is shallow, brittle, and often politically skewed.
- Corporate “safety” often encodes particular US‑liberal or Chinese state orthodoxies.
- Others emphasize that all models are biased by data and post‑training; the issue is whose values dominate, not whether bias exists.
Potential Reverse Use and Extensions
- Commenters note the same method could, in principle, strengthen or redirect safety by targeting other activation patterns, though harmful behaviors are likely more diverse than refusals.
- Some speculate on extending similar techniques to diffusion/image‑edit models, but that would require new detectors and engineering effort.