Chain of Recursive Thoughts: Make AI think harder by making it argue with itself
Multi-agent workflows and tools
- Several commenters are building or using graph/flow UIs (Unreal-style, n8n, Autogen Studio, Fast Agent, llm-consortium) to wire together:
- “Writer” agents, “harsh critic” agents, arbiters/judges, and iterative loops until a pass/score threshold.
- Multi-model “consortia” where different models specialize (research, JSON extraction, drafting) and an arbiter synthesizes.
- Interest in “group chat” UIs with multiple LLM personalities and even multiple providers, to get second opinions or diverse perspectives.
Prompt patterns for self-critique
- Many people already hand-roll simpler versions:
- Ask for thinking → critique → revised thinking cycles.
- Force enumeration of flaws (“find the 5 biggest issues”, “put on your critical hat”).
- Plan → find flaws → update plan in sequential messages.
- Adversarial roles (research assistant vs skeptical department head, attorney vs opponent, councils or “senates” of personas).
- Some use humorous or motivational framing (“you need to leave the meeting to use the bathroom”) to push concision.
Debate, novelty, and prior work
- Strong agreement that debate / self-argument boosts depth and catches errors, similar to Socratic methods and classic “society of mind” ideas.
- Others argue this is well-trodden: STORM, Tree-of-Thought, inference-time scaling, “LLM as judge”, and numerous NeurIPS/ICML/ICLR papers already cover multi-agent and debate-style reasoning.
- Some describe Monte Carlo or genetic-style approaches (many branches, scoring, then refinement).
Limits of LLM-vs-LLM checking
- A substantial subthread is skeptical that LLMs can reliably verify each other:
- Reports that self-critique can reduce accuracy; external verifiers (compilers, SAT solvers, tests) often give much bigger wins.
- For coding, people see models hallucinate flags, APIs, and simplistically game tests unless those tests are reviewed or property-based.
- Consensus: generation is easier than verification for today’s models; LLMs as judges are useful, but not trustworthy as sole verifiers.
Practical concerns and behavior
- Cost and latency: multiple rounds and agents can mean large token usage and slower responses; some question if this beats “best-of-N” single-step sampling.
- Energy and infrastructure: concern that endless AI debates will strain power/cooling.
- Safety: running AI-generated commands/tests raises “rm -rf /” worries; sandboxing (Docker, manual approval, firewalled prompts) is recommended.
- Behaviorally, multi-agent chats often converge to agreement or sycophancy unless prompts are very carefully engineered.
Miscellaneous
- Side debate on numeronym-style names (n8n, k8s, i18n, a11y) as confusing jargon versus acceptable community shorthand.
- Philosophical back-and-forth on whether “thinking” is an appropriate term for token prediction with recursive scaffolding.