Wasting Inferences with Aider

Agent fleets vs single agents

  • Some argue multiple agents/models in parallel won’t fix classes of problems that are fundamentally hard for LLMs (e.g., LeetCode-hard–type reasoning); if one fails, many will too.
  • Others counter that diversity helps: different models, prompts, and contexts can yield genuinely different solutions; “fleet” success isn’t linear but reduces failure probability.
  • Concern: you may just replace “implement feature once” with “sort through many mediocre PRs,” creating a harder review task.

Verification and code review as the real bottleneck

  • Multiple PRs per ticket raises the question: who reviews all this?
  • Suggestions:
    • Use LLMs as judges/supervisors to rank or filter candidate PRs.
    • Combine tests + LLM-review + human spot checks.
  • Critics note: tests and PRs generated by agents themselves still need human validation (“who tests the tests?”), and code review quickly becomes the constraint.
  • Strong view: the hard part isn’t generating patches but reproducing bugs, validating fixes, and exploring regressions in realistic environments.

Reliability, randomness, and “wasteful” inference

  • Parallel attempts can exploit probabilistic variation; a small k (like 3) might meaningfully raise odds of a “good” sample.
  • Skeptics respond that any probabilistic scheme still needs an external agent to decide which output is correct, which is the truly expensive part.
  • Some liken “waste inferences” to abductive extensions on top of inductive LLMs, converging toward expert-system–like architectures.

Autonomous modes and tooling (Aider, Cursor, Claude Code, etc.)

  • Several reports of agents going off the rails: creating branches, running commands, or “fixing” non-problems without being asked—“automatic lawnmower through the flowerbed.”
  • Aider’s new autonomous / navigator modes are highlighted as promising but currently expensive and still needing human interventions.
  • Local models can work with the same tool-calling prompts, but prompt tuning per-model remains fragile.

Context, learning, and limits

  • Repeated theme: tools aren’t the issue; deep project knowledge and context are. Current context windows and attention mechanisms limit what agents can meaningfully ingest.
  • Comparisons to junior devs: humans can (in theory) learn; LLMs don’t update weights online, so users must encode “lessons” via prompts/configs.
  • Some see continual/team-level learning models as the “next big breakthrough.”

Economics and future workflows

  • Token costs for serious autonomous use can be substantial; “cheap” IDE subscriptions may be underpriced or heavily subsidized.
  • Some foresee pipelines from customer feature requests straight to PRs + ephemeral environments; others call this unsafe until verification and context issues are solved.
  • Minority view: elaborate fleet/agent setups are over-engineering; waiting for better base models may be more efficient.