Natural Language Autoencoders: Turning Claude's Thoughts into Text

Overview of NLAs in the Discussion

  • NLAs translate intermediate activations of an LLM into natural-language “explanations,” then reconstruct activations from those texts.
  • Many commenters see this as a major interpretability step, especially for surfacing hidden motivations, test-detection, or planning structure.
  • Others emphasize it’s more of a powerful debugging/auditing hack than a general-purpose capability.

Methodology & Technical Points

  • Setup: a frozen base model; an Activation Verbalizer (AV) that outputs text from a chosen layer’s activations; an Activation Reconstructor (AR) that maps text back to those activations.
  • AV and AR are warm-started via supervised fine-tuning on summarization-like data, then optimized with reconstruction loss; a KL term keeps outputs near fluent language.
  • Only a single layer is typically probed; some suggest probing multiple layers to track “evolving thoughts.”
  • Clarifications: “activations” refer to a specific layer’s residual stream, not all layers at once.

Faithfulness, Steganography, and Goodhart

  • Core worry: nothing in the loss forces explanations to match the model’s true “reasons”; any invertible code, even gibberish or “Opposite Day” semantics, could work.
  • Several point to evidence in the paper: paraphrasing, shuffling, or translating explanations only slightly harms reconstruction, suggesting limited steganography.
  • Others counter that AV/AR might still encode only broad latents (“this is adversarial”) rather than the exact causal chain.
  • Concerns that if NLAs are ever used as a training signal, models could learn to game them (Goodhart’s law).

Empirical Performance & Limitations

  • Reported success in uncovering hidden motivations is modest (around low teens in percent), better than prior methods but far from comprehensive.
  • AV outputs can be janky or rambling, especially on some open models (e.g., certain Llama/Gemma checkpoints).
  • The NLA itself can hallucinate or infer beyond what’s in the activations.

Open-Source Release & Community Dynamics

  • Strong positive reaction to releasing code and open-weight AV/AR models for several open LLMs; seen as good for the safety and local-model communities.
  • Some argue this doesn’t count as real open-sourcing because the main commercial model remains closed, and accuse large labs of “leeching” on open models. Others push back, noting full training code and checkpoints were released.

Broader Implications & Open Questions

  • Questions about whether architectural similarity between base and AV/AR is crucial remain open; no cross-architecture comparisons were reported.
  • Some propose using edited explanations (e.g., swapping “rabbit” for “mouse”) as causal tests; one such example exists but only works about half the time.
  • Multiple commenters see this as a plausible path toward enforceable interpretability standards, while emphasizing that trustworthiness of “thought” readouts is still unresolved.