Even 'uncensored' models can't say what they want

Overall theme

  • Thread reacts to evidence that even “uncensored” / jailbreaked models still systematically down-weight (“flinch from”) certain words, especially around sex, slurs, and politics.
  • Debate centers on where this comes from (data filtering vs RLHF vs natural language use), how to measure it, and whether it’s dangerous or expected.

Model behavior & “flinching”

  • Multiple comments note that removing refusal heads doesn’t restore original token distributions; pretraining, SFT, RLHF, and synthetic data already pulled “charged” tokens down.
  • Some are surprised at which categories flinch most (e.g., sexual, anti‑Europe), and speculate about:
    • Skewed or sparse training data.
    • Explicit instructions to be more open to US/China criticism.
  • People question the benchmark design:
    • Odd “carrier” sentences and word choices (e.g., “financial without any legal recourse” being nonsensical).
    • Missing control categories (e.g., foods) to see baseline flinch.
    • Unclear how “pure fluency” probability is defined.

Semantics, intelligence & meaning

  • Long subthread on whether LLMs “know” anything or are just “fancy autocorrect.”
  • Comparisons to:
    • Markov chains (grammatically OK, semantically off).
    • Human brains as neural nets versus dualist views.
  • Some argue semantic meaning requires mapping to lived experience; others counter that next‑token prediction implicitly approximates rich mental models.
  • Several see LLM output as rhetorically polished but semantically shallow “junk food.”

RLHF, style, and “slop”

  • Multiple comments blame RLHF on human evaluators for selecting hollow but persuasive rhetorical patterns.
  • Observations:
    • Overuse of certain high‑status constructions (“it’s not X, it’s Y”) that humans rarely employ.
    • Models optimized more for “affect a human” than for task correctness.
  • Some lament that AI makes it trivial to generate good‑sounding slop, diluting stylistic cues that once signaled genuine effort.

Bias, censorship & politics

  • Concern that quiet probability shifts are a powerful lever for shaping discourse “without users noticing.”
  • One commenter cites a Microsoft safety dataset labeling certain pro‑white phrases as hate speech as an example of directional bias.
  • Others note missing or euphemized slurs in the benchmark itself as a kind of meta‑“flinch.”
  • Debate over whether patterns reflect political correctness vs real-world language norms (e.g., regional comfort with certain slurs).

Open models & uncensoring

  • Some expected uncensored local models to freely reproduce controversial speech and are surprised that deep training bias persists after jailbreaks.
  • Others say the main value of uncensored models is avoiding hard refusals (e.g., getting medical advice offline), not enabling hate speech.
  • A few advocate an OSS-style distributed training ecosystem, but worry about hardware costs and corporate efforts to “close the drawbridge.”