2025-04-13

Wasting Inferences with Aider

Agent fleets vs single agents

Some argue multiple agents/models in parallel won’t fix classes of problems that are fundamentally hard for LLMs (e.g., LeetCode-hard–type reasoning); if one fails, many will too.
Others counter that diversity helps: different models, prompts, and contexts can yield genuinely different solutions; “fleet” success isn’t linear but reduces failure probability.
Concern: you may just replace “implement feature once” with “sort through many mediocre PRs,” creating a harder review task.

Verification and code review as the real bottleneck

Multiple PRs per ticket raises the question: who reviews all this?
Suggestions:
- Use LLMs as judges/supervisors to rank or filter candidate PRs.
- Combine tests + LLM-review + human spot checks.
Critics note: tests and PRs generated by agents themselves still need human validation (“who tests the tests?”), and code review quickly becomes the constraint.
Strong view: the hard part isn’t generating patches but reproducing bugs, validating fixes, and exploring regressions in realistic environments.

Reliability, randomness, and “wasteful” inference

Parallel attempts can exploit probabilistic variation; a small k (like 3) might meaningfully raise odds of a “good” sample.
Skeptics respond that any probabilistic scheme still needs an external agent to decide which output is correct, which is the truly expensive part.
Some liken “waste inferences” to abductive extensions on top of inductive LLMs, converging toward expert-system–like architectures.

Autonomous modes and tooling (Aider, Cursor, Claude Code, etc.)

Several reports of agents going off the rails: creating branches, running commands, or “fixing” non-problems without being asked—“automatic lawnmower through the flowerbed.”
Aider’s new autonomous / navigator modes are highlighted as promising but currently expensive and still needing human interventions.
Local models can work with the same tool-calling prompts, but prompt tuning per-model remains fragile.

Context, learning, and limits

Repeated theme: tools aren’t the issue; deep project knowledge and context are. Current context windows and attention mechanisms limit what agents can meaningfully ingest.
Comparisons to junior devs: humans can (in theory) learn; LLMs don’t update weights online, so users must encode “lessons” via prompts/configs.
Some see continual/team-level learning models as the “next big breakthrough.”

Economics and future workflows

Token costs for serious autonomous use can be substantial; “cheap” IDE subscriptions may be underpriced or heavily subsidized.
Some foresee pipelines from customer feature requests straight to PRs + ephemeral environments; others call this unsafe until verification and context issues are solved.
Minority view: elaborate fleet/agent setups are over-engineering; waiting for better base models may be more efficient.

Related topics