2024-06-06

Extracting concepts from GPT-4

Overall reception

Many find the work exciting as a move toward “deep” semantic search and interpretable concepts inside GPT-4.
Others see it as incremental and still evidence that LLMs are largely black boxes, citing the article’s own statements about how little is understood.

Relation to prior interpretability work

Repeated comparisons to Anthropic’s sparse autoencoder / “Scaling Monosemanticity” work.
Some claim OpenAI is mostly copying; others argue the methods are concurrent and OpenAI introduces meaningful additions (e.g., new activation functions, dead-latent mitigation, new evaluations).
Several note Anthropic’s demos and visualizations feel more polished and “impressive,” while OpenAI emphasizes less-cherry-picked, more random features to avoid interpretability illusions.

What sparse autoencoders enable

Thread-wide explanation: models contain many “features” or “concepts” encoded in internal activations (from punctuation to historical facts to price changes).
SAEs are presented as a way to decompose these tangled activations into sparse, human-interpretable features.
This could allow:
- Inspecting which concepts fire for given prompts.
- Ablating or boosting concepts to study behavior or steer outputs.
- Possibly manipulating specific knowledge or safety-relevant concepts without disrupting others.

Limits, skepticism, and open questions

Multiple commenters stress this is very early; we still lack a general understanding of how transformers work, why capabilities emerge, or how to fully debug them.
Debate over hallucinations: some think interpretability could eventually help; others argue LLMs are “always hallucinating,” and distinguishing fact vs. fiction internally may be ill-posed.
Some doubt whether we’ll ever get a low-level, brain-like understanding of such complex systems.

Safety, risk, and legal angles

View that better interpretability is crucial for safety (e.g., detecting deception, controlling harmful concepts).
Counterpoint that we already know the system “just outputs tokens,” and real risk lies in how people use those outputs.
Extended debate using analogies (knives, cars, Google Search, social media) about what should be regulated: underlying tech vs. applications.
Questions about training data: viewer uses The Pile as “uncopyrighted,” implying internal GPT-4 data is copyright-sensitive; some raise potential legal and “fair use” issues.

Potential applications and tooling

Ideas include semantic search based on concept activations, hybrid search with sparse features, caching “hot spots” to speed inference, browser extensions for knowledge workers, and better content filtering.
Open-sourced SAE code (on an older open model) and tokenization tools are noted as practical outputs.

Related topics