2025-05-15

LLMs get lost in multi-turn conversation

Context Poisoning & Multi-Turn Degradation

Many commenters say the paper matches everyday experience: once a conversation “gets poisoned” by a wrong assumption, bad answer, or off-topic tangent, quality often degrades irreversibly.
Memory features and cross-chat “personalization” are seen as risky; some disable memory because it propagates mistakes or irrelevant facts into new chats.
People notice LLMs tend to stick to early interpretations even after being corrected, suggesting a bias toward the first “complete” answer rather than ongoing belief revision.

User Strategies & Interface Ideas

Common workaround: frequently start new chats, carry over only a concise summary, spec, or small curated code/context sample.
Heavy users rely on:
- Editing or deleting previous turns to “unpoison” context.
- Forking/branching conversations from earlier points.
- Manual or automatic compaction/summarization of history.
Several tools and workflows are mentioned (local UIs, editors, bots) that let users edit history, compact context, or branch chats; many want Git-like branching and bookmarking as first-class UX.
Some advocate “conversation version control” and treating chats as editable documents, not immutable logs.

Capabilities, Limits, and Human Comparison

Some describe long, successful multi-week debugging or protocol-deconstruction sessions, but note it worked best when:
- The human steered strongly and knew the goal.
- The LLM mostly compressed complex material into clearer explanations.
Others report mixed results in complex, versioned domains (e.g., cellular protocols, frameworks), with hallucinated features or version-mixing.
There’s debate over how analogous LLM failures are to human confusion; users say LLMs feel different because once off-track they rarely recover.

Prompting, Clarification, and “Thinking” Models

A recurring criticism: LLMs seldom ask for clarification, instead guessing and confidently running with underspecified instructions.
Some say you can train or prompt them to ask questions or self-check (Socratic or “multiple minds” styles), but others doubt they truly “know when they don’t know” vs. asking at arbitrary times.
Overconfidence and lack of introspection are framed as architectural consequences of autoregressive next-token prediction and training on “happy path” data.
One thread argues that test-time reasoning / “thinking” models and chain-of-thought might mitigate this, and criticizes the paper for not evaluating them.

Tooling, Agents, and Research Gaps

Multiple comments propose “curator” or meta-agents that dynamically prune, rewrite, or RAG-ify chat history, as well as richer memory hierarchies (training data vs. context vs. external store).
Others stress that prompt engineering is really ongoing context management, not just the initial system prompt.
Some want more empirical guidance on practical context limits for coding and long projects, and question missing evaluations of certain open models (Qwen, Mistral).

Related topics