LLMs get lost in multi-turn conversation
Context Poisoning & Multi-Turn Degradation
- Many commenters say the paper matches everyday experience: once a conversation “gets poisoned” by a wrong assumption, bad answer, or off-topic tangent, quality often degrades irreversibly.
- Memory features and cross-chat “personalization” are seen as risky; some disable memory because it propagates mistakes or irrelevant facts into new chats.
- People notice LLMs tend to stick to early interpretations even after being corrected, suggesting a bias toward the first “complete” answer rather than ongoing belief revision.
User Strategies & Interface Ideas
- Common workaround: frequently start new chats, carry over only a concise summary, spec, or small curated code/context sample.
- Heavy users rely on:
- Editing or deleting previous turns to “unpoison” context.
- Forking/branching conversations from earlier points.
- Manual or automatic compaction/summarization of history.
- Several tools and workflows are mentioned (local UIs, editors, bots) that let users edit history, compact context, or branch chats; many want Git-like branching and bookmarking as first-class UX.
- Some advocate “conversation version control” and treating chats as editable documents, not immutable logs.
Capabilities, Limits, and Human Comparison
- Some describe long, successful multi-week debugging or protocol-deconstruction sessions, but note it worked best when:
- The human steered strongly and knew the goal.
- The LLM mostly compressed complex material into clearer explanations.
- Others report mixed results in complex, versioned domains (e.g., cellular protocols, frameworks), with hallucinated features or version-mixing.
- There’s debate over how analogous LLM failures are to human confusion; users say LLMs feel different because once off-track they rarely recover.
Prompting, Clarification, and “Thinking” Models
- A recurring criticism: LLMs seldom ask for clarification, instead guessing and confidently running with underspecified instructions.
- Some say you can train or prompt them to ask questions or self-check (Socratic or “multiple minds” styles), but others doubt they truly “know when they don’t know” vs. asking at arbitrary times.
- Overconfidence and lack of introspection are framed as architectural consequences of autoregressive next-token prediction and training on “happy path” data.
- One thread argues that test-time reasoning / “thinking” models and chain-of-thought might mitigate this, and criticizes the paper for not evaluating them.
Tooling, Agents, and Research Gaps
- Multiple comments propose “curator” or meta-agents that dynamically prune, rewrite, or RAG-ify chat history, as well as richer memory hierarchies (training data vs. context vs. external store).
- Others stress that prompt engineering is really ongoing context management, not just the initial system prompt.
- Some want more empirical guidance on practical context limits for coding and long projects, and question missing evaluations of certain open models (Qwen, Mistral).