2025-05-15

The unreasonable effectiveness of an LLM agent loop with tool use

Model quality and tool-calling behavior

Many reports of inconsistent coding quality across models:
- Claude Sonnet 3.7 seen as powerful but prone to weird detours, test-skipping, and “just catch the exception”‑style fixes.
- GPT‑4o/4.1 often break code, truncate files, or refuse to apply edits directly; 4o especially criticized for coding.
- o1/o3 “reasoning” models described as uniquely good at handling ~1,000 LOC full‑file edits, but expensive and/or rate‑limited.
- Gemini 2.5 Pro praised for intelligence and tool-calling, but some find it reluctant or clumsy with tools in certain UIs.
- Mistral Medium 3 and some local/Qwen models seen as surprisingly strong for cost, especially via OpenRouter/Ollama.
Tool use itself is uneven: some models hallucinate diff formats, misuse deprecated packages despite “knowing” they’re deprecated, or claim to have called tools when they haven’t. Others, when wired to compilers/tests/shell, self‑correct effectively in loops.

Workflows, agents, and context management

Strong consensus that raw chat UI is the wrong interface for serious coding; dedicated tools (Claude Code, Cursor/Windsurf, Cline, Aider, Augment, Amazon Q, etc.) matter more than the base model alone.
Effective patterns:
- Treat LLM as a junior/dev pair: write specs, ask for plans, phases, and tests first; then iterate.
- Use agents to run commands, tests, and linters automatically in a loop, often inside containers or devcontainers.
- Use git as the backbone: small branches, frequent commits, LLM-generated PRDs/PLAN docs, and multiple LLMs reviewing each other’s changes.
- Libraries and mini-frameworks (nanoagent, toolkami, PocketFlow, custom MCP servers) implement the basic “while loop + tools” pattern for coding, text‑to‑SQL, REST/web search, device automation, etc.
Long-horizon reliability requires aggressive context control: pruning, custom MCP layers, guardrails, and “forgetting” past detail to avoid drift.

Productivity, “vibe coding,” and reliability debate

Enthusiasts report 10x speedups on greenfield work and huge gains for tests, refactors, boilerplate, and multi-layer design iteration.
Others find the experience brittle beyond a few hundred LOC, with agents getting stuck, degrading over long conversations, or running off on tangents.
“Vibe coding” (accept-all, error–LLM–error loops) is sharply contested:
- Fans liken it to surfing and claim it works well for CRUD/throwaway apps.
- Critics call it “monkeys with knives,” stressing maintainability, outages, and lost learning for juniors.
Broad agreement that LLM use is a learned skill; success depends on coaching the model, picking the right model/tooling combo, and keeping a human firmly “in the loop.”

Safety, economics, and ecosystem

Letting agents run bash/install tools is viewed as powerful but risky; some rely on containers and version control, others worry about trivial payloads via shell.
Concerns about cost and API pricing (especially for reasoning models); some users dodge this via UI plans or cheaper models.
Many note that 90% reliability is far from production‑grade; “the last 10%” (and beyond) grows exponentially harder, though reinforcement learning and monitoring agents show promise.

Related topics