Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Human communication, “truth,” and model politeness
- Some see parallels between how people soften blunt truths for social reasons and how LLMs are “watered down” by safety tuning.
- Others argue that full bluntness is not “more true” if it ignores human motivation and outcomes; social sensitivity is part of the “fullest truth.”
- Debate over whether modern “sensitivity” and political correctness (in people and AI) has gone too far, and who decides the right level.
Goals, agency, and whether LLMs “think”
- One side: LLMs have no real goals, do nothing between prompts, don’t self-update weights, and show limited long-term coherence; they resemble fixed-rule expert systems.
- Counterpoint: By imitating human goal-directed language, they can exhibit effective “implicit goals,” though likely short-lived.
- Discussion of whether real “thinking” requires continuous computation, persistent internal memory, or self-correction mid-output; some say these criteria are arbitrary.
- Technical back-and-forth on KV caches: whether token generation “starts from scratch” or reuses previous internal states.
Mechanistic interpretability & monosemanticity
- Many find the sparse autoencoder / dictionary-learning approach on a large production model deeply exciting, especially the ability to:
- Isolate features that correspond to high-level concepts (e.g., locations, vulnerabilities, refusals).
- Show multimodal and multilingual alignment of features.
- Manipulate features to change model behavior in controlled ways.
- Others see it as an incremental extension of earlier probing/ablation work and question how much is genuinely new vs scaled-up.
Safety, alignment, and norms
- Some praise the work as evidence of serious safety effort and contrast it with other labs’ recent turbulence.
- Others are skeptical of “AI safety” as framed, or of narrow, top-down norms (e.g., blanket NSFW bans), and doubt that interpretability meaningfully proves “understanding.”
Control, customization, and misuse potential
- Strong interest in using discovered features for:
- Finer-grained controllability (e.g., “semantic equalizer,” de-watering corporate tone, better code quality).
- Training-time steering and topic emphasis.
- Concerns that similar methods could amplify harmful “intentions” (e.g., making a model more “evil” or obsessed with a topic).
Concepts, latent space, and human analogy
- Debate over whether discovered features reflect real, pre-existing conceptual structure or are partly artifacts of the interpretability method.
- Discussion of how similar different models’ or humans’ “concept spaces” are, and whether convergence reflects a shared external world or just poetic metaphors.