Natural Language Autoencoders: Turning Claude's Thoughts into Text
Overview of NLAs in the Discussion
- NLAs translate intermediate activations of an LLM into natural-language “explanations,” then reconstruct activations from those texts.
- Many commenters see this as a major interpretability step, especially for surfacing hidden motivations, test-detection, or planning structure.
- Others emphasize it’s more of a powerful debugging/auditing hack than a general-purpose capability.
Methodology & Technical Points
- Setup: a frozen base model; an Activation Verbalizer (AV) that outputs text from a chosen layer’s activations; an Activation Reconstructor (AR) that maps text back to those activations.
- AV and AR are warm-started via supervised fine-tuning on summarization-like data, then optimized with reconstruction loss; a KL term keeps outputs near fluent language.
- Only a single layer is typically probed; some suggest probing multiple layers to track “evolving thoughts.”
- Clarifications: “activations” refer to a specific layer’s residual stream, not all layers at once.
Faithfulness, Steganography, and Goodhart
- Core worry: nothing in the loss forces explanations to match the model’s true “reasons”; any invertible code, even gibberish or “Opposite Day” semantics, could work.
- Several point to evidence in the paper: paraphrasing, shuffling, or translating explanations only slightly harms reconstruction, suggesting limited steganography.
- Others counter that AV/AR might still encode only broad latents (“this is adversarial”) rather than the exact causal chain.
- Concerns that if NLAs are ever used as a training signal, models could learn to game them (Goodhart’s law).
Empirical Performance & Limitations
- Reported success in uncovering hidden motivations is modest (around low teens in percent), better than prior methods but far from comprehensive.
- AV outputs can be janky or rambling, especially on some open models (e.g., certain Llama/Gemma checkpoints).
- The NLA itself can hallucinate or infer beyond what’s in the activations.
Open-Source Release & Community Dynamics
- Strong positive reaction to releasing code and open-weight AV/AR models for several open LLMs; seen as good for the safety and local-model communities.
- Some argue this doesn’t count as real open-sourcing because the main commercial model remains closed, and accuse large labs of “leeching” on open models. Others push back, noting full training code and checkpoints were released.
Broader Implications & Open Questions
- Questions about whether architectural similarity between base and AV/AR is crucial remain open; no cross-architecture comparisons were reported.
- Some propose using edited explanations (e.g., swapping “rabbit” for “mouse”) as causal tests; one such example exists but only works about half the time.
- Multiple commenters see this as a plausible path toward enforceable interpretability standards, while emphasizing that trustworthiness of “thought” readouts is still unresolved.