Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Human communication, “truth,” and model politeness

  • Some see parallels between how people soften blunt truths for social reasons and how LLMs are “watered down” by safety tuning.
  • Others argue that full bluntness is not “more true” if it ignores human motivation and outcomes; social sensitivity is part of the “fullest truth.”
  • Debate over whether modern “sensitivity” and political correctness (in people and AI) has gone too far, and who decides the right level.

Goals, agency, and whether LLMs “think”

  • One side: LLMs have no real goals, do nothing between prompts, don’t self-update weights, and show limited long-term coherence; they resemble fixed-rule expert systems.
  • Counterpoint: By imitating human goal-directed language, they can exhibit effective “implicit goals,” though likely short-lived.
  • Discussion of whether real “thinking” requires continuous computation, persistent internal memory, or self-correction mid-output; some say these criteria are arbitrary.
  • Technical back-and-forth on KV caches: whether token generation “starts from scratch” or reuses previous internal states.

Mechanistic interpretability & monosemanticity

  • Many find the sparse autoencoder / dictionary-learning approach on a large production model deeply exciting, especially the ability to:
    • Isolate features that correspond to high-level concepts (e.g., locations, vulnerabilities, refusals).
    • Show multimodal and multilingual alignment of features.
    • Manipulate features to change model behavior in controlled ways.
  • Others see it as an incremental extension of earlier probing/ablation work and question how much is genuinely new vs scaled-up.

Safety, alignment, and norms

  • Some praise the work as evidence of serious safety effort and contrast it with other labs’ recent turbulence.
  • Others are skeptical of “AI safety” as framed, or of narrow, top-down norms (e.g., blanket NSFW bans), and doubt that interpretability meaningfully proves “understanding.”

Control, customization, and misuse potential

  • Strong interest in using discovered features for:
    • Finer-grained controllability (e.g., “semantic equalizer,” de-watering corporate tone, better code quality).
    • Training-time steering and topic emphasis.
  • Concerns that similar methods could amplify harmful “intentions” (e.g., making a model more “evil” or obsessed with a topic).

Concepts, latent space, and human analogy

  • Debate over whether discovered features reflect real, pre-existing conceptual structure or are partly artifacts of the interpretability method.
  • Discussion of how similar different models’ or humans’ “concept spaces” are, and whether convergence reflects a shared external world or just poetic metaphors.