2024-05-21

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Human communication, “truth,” and model politeness

Some see parallels between how people soften blunt truths for social reasons and how LLMs are “watered down” by safety tuning.
Others argue that full bluntness is not “more true” if it ignores human motivation and outcomes; social sensitivity is part of the “fullest truth.”
Debate over whether modern “sensitivity” and political correctness (in people and AI) has gone too far, and who decides the right level.

Goals, agency, and whether LLMs “think”

One side: LLMs have no real goals, do nothing between prompts, don’t self-update weights, and show limited long-term coherence; they resemble fixed-rule expert systems.
Counterpoint: By imitating human goal-directed language, they can exhibit effective “implicit goals,” though likely short-lived.
Discussion of whether real “thinking” requires continuous computation, persistent internal memory, or self-correction mid-output; some say these criteria are arbitrary.
Technical back-and-forth on KV caches: whether token generation “starts from scratch” or reuses previous internal states.

Mechanistic interpretability & monosemanticity

Many find the sparse autoencoder / dictionary-learning approach on a large production model deeply exciting, especially the ability to:
- Isolate features that correspond to high-level concepts (e.g., locations, vulnerabilities, refusals).
- Show multimodal and multilingual alignment of features.
- Manipulate features to change model behavior in controlled ways.
Others see it as an incremental extension of earlier probing/ablation work and question how much is genuinely new vs scaled-up.

Safety, alignment, and norms

Some praise the work as evidence of serious safety effort and contrast it with other labs’ recent turbulence.
Others are skeptical of “AI safety” as framed, or of narrow, top-down norms (e.g., blanket NSFW bans), and doubt that interpretability meaningfully proves “understanding.”

Control, customization, and misuse potential

Strong interest in using discovered features for:
- Finer-grained controllability (e.g., “semantic equalizer,” de-watering corporate tone, better code quality).
- Training-time steering and topic emphasis.
Concerns that similar methods could amplify harmful “intentions” (e.g., making a model more “evil” or obsessed with a topic).

Concepts, latent space, and human analogy

Debate over whether discovered features reflect real, pre-existing conceptual structure or are partly artifacts of the interpretability method.
Discussion of how similar different models’ or humans’ “concept spaces” are, and whether convergence reflects a shared external world or just poetic metaphors.

Related topics