Extracting concepts from GPT-4

Overall reception

  • Many find the work exciting as a move toward “deep” semantic search and interpretable concepts inside GPT-4.
  • Others see it as incremental and still evidence that LLMs are largely black boxes, citing the article’s own statements about how little is understood.

Relation to prior interpretability work

  • Repeated comparisons to Anthropic’s sparse autoencoder / “Scaling Monosemanticity” work.
  • Some claim OpenAI is mostly copying; others argue the methods are concurrent and OpenAI introduces meaningful additions (e.g., new activation functions, dead-latent mitigation, new evaluations).
  • Several note Anthropic’s demos and visualizations feel more polished and “impressive,” while OpenAI emphasizes less-cherry-picked, more random features to avoid interpretability illusions.

What sparse autoencoders enable

  • Thread-wide explanation: models contain many “features” or “concepts” encoded in internal activations (from punctuation to historical facts to price changes).
  • SAEs are presented as a way to decompose these tangled activations into sparse, human-interpretable features.
  • This could allow:
    • Inspecting which concepts fire for given prompts.
    • Ablating or boosting concepts to study behavior or steer outputs.
    • Possibly manipulating specific knowledge or safety-relevant concepts without disrupting others.

Limits, skepticism, and open questions

  • Multiple commenters stress this is very early; we still lack a general understanding of how transformers work, why capabilities emerge, or how to fully debug them.
  • Debate over hallucinations: some think interpretability could eventually help; others argue LLMs are “always hallucinating,” and distinguishing fact vs. fiction internally may be ill-posed.
  • Some doubt whether we’ll ever get a low-level, brain-like understanding of such complex systems.

Safety, risk, and legal angles

  • View that better interpretability is crucial for safety (e.g., detecting deception, controlling harmful concepts).
  • Counterpoint that we already know the system “just outputs tokens,” and real risk lies in how people use those outputs.
  • Extended debate using analogies (knives, cars, Google Search, social media) about what should be regulated: underlying tech vs. applications.
  • Questions about training data: viewer uses The Pile as “uncopyrighted,” implying internal GPT-4 data is copyright-sensitive; some raise potential legal and “fair use” issues.

Potential applications and tooling

  • Ideas include semantic search based on concept activations, hybrid search with sparse features, caching “hot spots” to speed inference, browser extensions for knowledge workers, and better content filtering.
  • Open-sourced SAE code (on an older open model) and tokenization tools are noted as practical outputs.