2025-04-29

Chain of Recursive Thoughts: Make AI think harder by making it argue with itself

Multi-agent workflows and tools

Several commenters are building or using graph/flow UIs (Unreal-style, n8n, Autogen Studio, Fast Agent, llm-consortium) to wire together:
- “Writer” agents, “harsh critic” agents, arbiters/judges, and iterative loops until a pass/score threshold.
- Multi-model “consortia” where different models specialize (research, JSON extraction, drafting) and an arbiter synthesizes.
Interest in “group chat” UIs with multiple LLM personalities and even multiple providers, to get second opinions or diverse perspectives.

Prompt patterns for self-critique

Many people already hand-roll simpler versions:
- Ask for thinking → critique → revised thinking cycles.
- Force enumeration of flaws (“find the 5 biggest issues”, “put on your critical hat”).
- Plan → find flaws → update plan in sequential messages.
- Adversarial roles (research assistant vs skeptical department head, attorney vs opponent, councils or “senates” of personas).
Some use humorous or motivational framing (“you need to leave the meeting to use the bathroom”) to push concision.

Debate, novelty, and prior work

Strong agreement that debate / self-argument boosts depth and catches errors, similar to Socratic methods and classic “society of mind” ideas.
Others argue this is well-trodden: STORM, Tree-of-Thought, inference-time scaling, “LLM as judge”, and numerous NeurIPS/ICML/ICLR papers already cover multi-agent and debate-style reasoning.
Some describe Monte Carlo or genetic-style approaches (many branches, scoring, then refinement).

Limits of LLM-vs-LLM checking

A substantial subthread is skeptical that LLMs can reliably verify each other:
- Reports that self-critique can reduce accuracy; external verifiers (compilers, SAT solvers, tests) often give much bigger wins.
- For coding, people see models hallucinate flags, APIs, and simplistically game tests unless those tests are reviewed or property-based.
- Consensus: generation is easier than verification for today’s models; LLMs as judges are useful, but not trustworthy as sole verifiers.

Practical concerns and behavior

Cost and latency: multiple rounds and agents can mean large token usage and slower responses; some question if this beats “best-of-N” single-step sampling.
Energy and infrastructure: concern that endless AI debates will strain power/cooling.
Safety: running AI-generated commands/tests raises “rm -rf /” worries; sandboxing (Docker, manual approval, firewalled prompts) is recommended.
Behaviorally, multi-agent chats often converge to agreement or sycophancy unless prompts are very carefully engineered.

Miscellaneous

Side debate on numeronym-style names (n8n, k8s, i18n, a11y) as confusing jargon versus acceptable community shorthand.
Philosophical back-and-forth on whether “thinking” is an appropriate term for token prediction with recursive scaffolding.

Related topics