Tracing the thoughts of a large language model

How LLMs “plan” vs next-token prediction

  • Many commenters challenge the cliché that LLMs “just predict the next token.”
  • They note that even strict next-token training on long contexts incentivizes learning long-range structure (sentences, paragraphs, rhyme schemes).
  • The paper’s poetry and “astronomer/an” examples are seen as evidence that models sometimes select earlier tokens (e.g., “An”) based on later intended tokens (“astronomer”), i.e., micro‑scale planning.
  • Some argue this is better described as high‑dimensional feature activations encoding future structure, not literal backtracking or explicit search.

Training beyond next-token: SFT, RL, and behavior

  • There is an extended debate over how much RL and supervised fine-tuning change model behavior vs base next-token pretraining.
  • One camp claims RL on whole responses is what makes chat models usable and pushes them toward long-horizon planning and reliability.
  • Others counter that base models already show planning-like behavior, and RL mostly calibrates style, safety, and reduces low‑quality or repetitive outputs.
  • Some emphasize that mechanically, all these models still generate one token at a time; the “just next-token” framing is misleading but not entirely wrong.

Interpretability, “thoughts,” and anthropomorphism

  • Many are impressed by attribution graphs and feature-level tracing; they see this as genuine progress in mechanistic interpretability and a needed alternative to treating models as pure black boxes.
  • Others criticize the framing as “hand-wavy,” marketing-like, or philosophically loaded—especially the repeated talk of “thoughts,” “planning,” and “language of thought.”
  • Several insist that using human mental terms (thinking, hallucinating, strategy) obscures the mechanistic, statistical nature of the systems and risks magical thinking.

Hallucinations / confabulation

  • The refusal circuit described as “on by default” and inhibited by “known entity” features is widely discussed.
  • Commenters connect this to misfires where recognition of a name suppresses “I don’t know” and triggers confident fabrication.
  • Some argue “hallucination” is a poor scientific term, proposing “confabulation” or standard error terminology (false positives/negatives), especially for RAG use cases.

Generality, multilingual representations, and “biology”

  • The finding that larger models share more features across languages supports the view that they build language‑agnostic conceptual representations.
  • Multilingual, language-independent features feel intuitive to multilingual humans, and some liken this to an internal “semantic space” with languages as coordinate systems.
  • Others liken this work to systems biology or neuroscience: mapping circuits, inhibition, and motifs in a grown artifact we didn’t explicitly design.

Scientific rigor, openness, and limits

  • Some question how much of the observed behavior is Claude‑specific and call for replications on open models (Llama, DeepSeek, etc.).
  • There is skepticism about selective examples, lack of broad quantitative tests, and the proprietary nature of Claude; a few label the work “pseudoacademic infomercials.”
  • Others respond that even if imperfect, these methods and visual tools are valuable starting points for a new science of understanding large learned systems.