2026-04-20

Even 'uncensored' models can't say what they want

Overall theme

Thread reacts to evidence that even “uncensored” / jailbreaked models still systematically down-weight (“flinch from”) certain words, especially around sex, slurs, and politics.
Debate centers on where this comes from (data filtering vs RLHF vs natural language use), how to measure it, and whether it’s dangerous or expected.

Model behavior & “flinching”

Multiple comments note that removing refusal heads doesn’t restore original token distributions; pretraining, SFT, RLHF, and synthetic data already pulled “charged” tokens down.
Some are surprised at which categories flinch most (e.g., sexual, anti‑Europe), and speculate about:
- Skewed or sparse training data.
- Explicit instructions to be more open to US/China criticism.
People question the benchmark design:
- Odd “carrier” sentences and word choices (e.g., “financial without any legal recourse” being nonsensical).
- Missing control categories (e.g., foods) to see baseline flinch.
- Unclear how “pure fluency” probability is defined.

Semantics, intelligence & meaning

Long subthread on whether LLMs “know” anything or are just “fancy autocorrect.”
Comparisons to:
- Markov chains (grammatically OK, semantically off).
- Human brains as neural nets versus dualist views.
Some argue semantic meaning requires mapping to lived experience; others counter that next‑token prediction implicitly approximates rich mental models.
Several see LLM output as rhetorically polished but semantically shallow “junk food.”

RLHF, style, and “slop”

Multiple comments blame RLHF on human evaluators for selecting hollow but persuasive rhetorical patterns.
Observations:
- Overuse of certain high‑status constructions (“it’s not X, it’s Y”) that humans rarely employ.
- Models optimized more for “affect a human” than for task correctness.
Some lament that AI makes it trivial to generate good‑sounding slop, diluting stylistic cues that once signaled genuine effort.

Bias, censorship & politics

Concern that quiet probability shifts are a powerful lever for shaping discourse “without users noticing.”
One commenter cites a Microsoft safety dataset labeling certain pro‑white phrases as hate speech as an example of directional bias.
Others note missing or euphemized slurs in the benchmark itself as a kind of meta‑“flinch.”
Debate over whether patterns reflect political correctness vs real-world language norms (e.g., regional comfort with certain slurs).

Open models & uncensoring

Some expected uncensored local models to freely reproduce controversial speech and are surprised that deep training bias persists after jailbreaks.
Others say the main value of uncensored models is avoiding hard refusals (e.g., getting medical advice offline), not enabling hate speech.
A few advocate an OSS-style distributed training ecosystem, but worry about hardware costs and corporate efforts to “close the drawbridge.”

Related topics