LLMs get lost in multi-turn conversation

Context Poisoning & Multi-Turn Degradation

  • Many commenters say the paper matches everyday experience: once a conversation “gets poisoned” by a wrong assumption, bad answer, or off-topic tangent, quality often degrades irreversibly.
  • Memory features and cross-chat “personalization” are seen as risky; some disable memory because it propagates mistakes or irrelevant facts into new chats.
  • People notice LLMs tend to stick to early interpretations even after being corrected, suggesting a bias toward the first “complete” answer rather than ongoing belief revision.

User Strategies & Interface Ideas

  • Common workaround: frequently start new chats, carry over only a concise summary, spec, or small curated code/context sample.
  • Heavy users rely on:
    • Editing or deleting previous turns to “unpoison” context.
    • Forking/branching conversations from earlier points.
    • Manual or automatic compaction/summarization of history.
  • Several tools and workflows are mentioned (local UIs, editors, bots) that let users edit history, compact context, or branch chats; many want Git-like branching and bookmarking as first-class UX.
  • Some advocate “conversation version control” and treating chats as editable documents, not immutable logs.

Capabilities, Limits, and Human Comparison

  • Some describe long, successful multi-week debugging or protocol-deconstruction sessions, but note it worked best when:
    • The human steered strongly and knew the goal.
    • The LLM mostly compressed complex material into clearer explanations.
  • Others report mixed results in complex, versioned domains (e.g., cellular protocols, frameworks), with hallucinated features or version-mixing.
  • There’s debate over how analogous LLM failures are to human confusion; users say LLMs feel different because once off-track they rarely recover.

Prompting, Clarification, and “Thinking” Models

  • A recurring criticism: LLMs seldom ask for clarification, instead guessing and confidently running with underspecified instructions.
  • Some say you can train or prompt them to ask questions or self-check (Socratic or “multiple minds” styles), but others doubt they truly “know when they don’t know” vs. asking at arbitrary times.
  • Overconfidence and lack of introspection are framed as architectural consequences of autoregressive next-token prediction and training on “happy path” data.
  • One thread argues that test-time reasoning / “thinking” models and chain-of-thought might mitigate this, and criticizes the paper for not evaluating them.

Tooling, Agents, and Research Gaps

  • Multiple comments propose “curator” or meta-agents that dynamically prune, rewrite, or RAG-ify chat history, as well as richer memory hierarchies (training data vs. context vs. external store).
  • Others stress that prompt engineering is really ongoing context management, not just the initial system prompt.
  • Some want more empirical guidance on practical context limits for coding and long projects, and question missing evaluations of certain open models (Qwen, Mistral).