Even 'uncensored' models can't say what they want
Overall theme
- Thread reacts to evidence that even “uncensored” / jailbreaked models still systematically down-weight (“flinch from”) certain words, especially around sex, slurs, and politics.
- Debate centers on where this comes from (data filtering vs RLHF vs natural language use), how to measure it, and whether it’s dangerous or expected.
Model behavior & “flinching”
- Multiple comments note that removing refusal heads doesn’t restore original token distributions; pretraining, SFT, RLHF, and synthetic data already pulled “charged” tokens down.
- Some are surprised at which categories flinch most (e.g., sexual, anti‑Europe), and speculate about:
- Skewed or sparse training data.
- Explicit instructions to be more open to US/China criticism.
- People question the benchmark design:
- Odd “carrier” sentences and word choices (e.g., “financial without any legal recourse” being nonsensical).
- Missing control categories (e.g., foods) to see baseline flinch.
- Unclear how “pure fluency” probability is defined.
Semantics, intelligence & meaning
- Long subthread on whether LLMs “know” anything or are just “fancy autocorrect.”
- Comparisons to:
- Markov chains (grammatically OK, semantically off).
- Human brains as neural nets versus dualist views.
- Some argue semantic meaning requires mapping to lived experience; others counter that next‑token prediction implicitly approximates rich mental models.
- Several see LLM output as rhetorically polished but semantically shallow “junk food.”
RLHF, style, and “slop”
- Multiple comments blame RLHF on human evaluators for selecting hollow but persuasive rhetorical patterns.
- Observations:
- Overuse of certain high‑status constructions (“it’s not X, it’s Y”) that humans rarely employ.
- Models optimized more for “affect a human” than for task correctness.
- Some lament that AI makes it trivial to generate good‑sounding slop, diluting stylistic cues that once signaled genuine effort.
Bias, censorship & politics
- Concern that quiet probability shifts are a powerful lever for shaping discourse “without users noticing.”
- One commenter cites a Microsoft safety dataset labeling certain pro‑white phrases as hate speech as an example of directional bias.
- Others note missing or euphemized slurs in the benchmark itself as a kind of meta‑“flinch.”
- Debate over whether patterns reflect political correctness vs real-world language norms (e.g., regional comfort with certain slurs).
Open models & uncensoring
- Some expected uncensored local models to freely reproduce controversial speech and are surprised that deep training bias persists after jailbreaks.
- Others say the main value of uncensored models is avoiding hard refusals (e.g., getting medical advice offline), not enabling hate speech.
- A few advocate an OSS-style distributed training ecosystem, but worry about hardware costs and corporate efforts to “close the drawbridge.”