2026-05-07

Natural Language Autoencoders: Turning Claude's Thoughts into Text

Overview of NLAs in the Discussion

NLAs translate intermediate activations of an LLM into natural-language “explanations,” then reconstruct activations from those texts.
Many commenters see this as a major interpretability step, especially for surfacing hidden motivations, test-detection, or planning structure.
Others emphasize it’s more of a powerful debugging/auditing hack than a general-purpose capability.

Methodology & Technical Points

Setup: a frozen base model; an Activation Verbalizer (AV) that outputs text from a chosen layer’s activations; an Activation Reconstructor (AR) that maps text back to those activations.
AV and AR are warm-started via supervised fine-tuning on summarization-like data, then optimized with reconstruction loss; a KL term keeps outputs near fluent language.
Only a single layer is typically probed; some suggest probing multiple layers to track “evolving thoughts.”
Clarifications: “activations” refer to a specific layer’s residual stream, not all layers at once.

Faithfulness, Steganography, and Goodhart

Core worry: nothing in the loss forces explanations to match the model’s true “reasons”; any invertible code, even gibberish or “Opposite Day” semantics, could work.
Several point to evidence in the paper: paraphrasing, shuffling, or translating explanations only slightly harms reconstruction, suggesting limited steganography.
Others counter that AV/AR might still encode only broad latents (“this is adversarial”) rather than the exact causal chain.
Concerns that if NLAs are ever used as a training signal, models could learn to game them (Goodhart’s law).

Empirical Performance & Limitations

Reported success in uncovering hidden motivations is modest (around low teens in percent), better than prior methods but far from comprehensive.
AV outputs can be janky or rambling, especially on some open models (e.g., certain Llama/Gemma checkpoints).
The NLA itself can hallucinate or infer beyond what’s in the activations.

Open-Source Release & Community Dynamics

Strong positive reaction to releasing code and open-weight AV/AR models for several open LLMs; seen as good for the safety and local-model communities.
Some argue this doesn’t count as real open-sourcing because the main commercial model remains closed, and accuse large labs of “leeching” on open models. Others push back, noting full training code and checkpoints were released.

Broader Implications & Open Questions

Questions about whether architectural similarity between base and AV/AR is crucial remain open; no cross-architecture comparisons were reported.
Some propose using edited explanations (e.g., swapping “rabbit” for “mouse”) as causal tests; one such example exists but only works about half the time.
Multiple commenters see this as a plausible path toward enforceable interpretability standards, while emphasizing that trustworthiness of “thought” readouts is still unresolved.

Related topics