Tracing the thoughts of a large language model
How LLMs “plan” vs next-token prediction
- Many commenters challenge the cliché that LLMs “just predict the next token.”
- They note that even strict next-token training on long contexts incentivizes learning long-range structure (sentences, paragraphs, rhyme schemes).
- The paper’s poetry and “astronomer/an” examples are seen as evidence that models sometimes select earlier tokens (e.g., “An”) based on later intended tokens (“astronomer”), i.e., micro‑scale planning.
- Some argue this is better described as high‑dimensional feature activations encoding future structure, not literal backtracking or explicit search.
Training beyond next-token: SFT, RL, and behavior
- There is an extended debate over how much RL and supervised fine-tuning change model behavior vs base next-token pretraining.
- One camp claims RL on whole responses is what makes chat models usable and pushes them toward long-horizon planning and reliability.
- Others counter that base models already show planning-like behavior, and RL mostly calibrates style, safety, and reduces low‑quality or repetitive outputs.
- Some emphasize that mechanically, all these models still generate one token at a time; the “just next-token” framing is misleading but not entirely wrong.
Interpretability, “thoughts,” and anthropomorphism
- Many are impressed by attribution graphs and feature-level tracing; they see this as genuine progress in mechanistic interpretability and a needed alternative to treating models as pure black boxes.
- Others criticize the framing as “hand-wavy,” marketing-like, or philosophically loaded—especially the repeated talk of “thoughts,” “planning,” and “language of thought.”
- Several insist that using human mental terms (thinking, hallucinating, strategy) obscures the mechanistic, statistical nature of the systems and risks magical thinking.
Hallucinations / confabulation
- The refusal circuit described as “on by default” and inhibited by “known entity” features is widely discussed.
- Commenters connect this to misfires where recognition of a name suppresses “I don’t know” and triggers confident fabrication.
- Some argue “hallucination” is a poor scientific term, proposing “confabulation” or standard error terminology (false positives/negatives), especially for RAG use cases.
Generality, multilingual representations, and “biology”
- The finding that larger models share more features across languages supports the view that they build language‑agnostic conceptual representations.
- Multilingual, language-independent features feel intuitive to multilingual humans, and some liken this to an internal “semantic space” with languages as coordinate systems.
- Others liken this work to systems biology or neuroscience: mapping circuits, inhibition, and motifs in a grown artifact we didn’t explicitly design.
Scientific rigor, openness, and limits
- Some question how much of the observed behavior is Claude‑specific and call for replications on open models (Llama, DeepSeek, etc.).
- There is skepticism about selective examples, lack of broad quantitative tests, and the proprietary nature of Claude; a few label the work “pseudoacademic infomercials.”
- Others respond that even if imperfect, these methods and visual tools are valuable starting points for a new science of understanding large learned systems.