The unreasonable effectiveness of an LLM agent loop with tool use
Model quality and tool-calling behavior
Many reports of inconsistent coding quality across models:
- Claude Sonnet 3.7 seen as powerful but prone to weird detours, test-skipping, and “just catch the exception”‑style fixes.
- GPT‑4o/4.1 often break code, truncate files, or refuse to apply edits directly; 4o especially criticized for coding.
- o1/o3 “reasoning” models described as uniquely good at handling ~1,000 LOC full‑file edits, but expensive and/or rate‑limited.
- Gemini 2.5 Pro praised for intelligence and tool-calling, but some find it reluctant or clumsy with tools in certain UIs.
- Mistral Medium 3 and some local/Qwen models seen as surprisingly strong for cost, especially via OpenRouter/Ollama.
Tool use itself is uneven: some models hallucinate diff formats, misuse deprecated packages despite “knowing” they’re deprecated, or claim to have called tools when they haven’t. Others, when wired to compilers/tests/shell, self‑correct effectively in loops.
Workflows, agents, and context management
- Strong consensus that raw chat UI is the wrong interface for serious coding; dedicated tools (Claude Code, Cursor/Windsurf, Cline, Aider, Augment, Amazon Q, etc.) matter more than the base model alone.
- Effective patterns:
- Treat LLM as a junior/dev pair: write specs, ask for plans, phases, and tests first; then iterate.
- Use agents to run commands, tests, and linters automatically in a loop, often inside containers or devcontainers.
- Use git as the backbone: small branches, frequent commits, LLM-generated PRDs/PLAN docs, and multiple LLMs reviewing each other’s changes.
- Libraries and mini-frameworks (nanoagent, toolkami, PocketFlow, custom MCP servers) implement the basic “while loop + tools” pattern for coding, text‑to‑SQL, REST/web search, device automation, etc.
- Long-horizon reliability requires aggressive context control: pruning, custom MCP layers, guardrails, and “forgetting” past detail to avoid drift.
Productivity, “vibe coding,” and reliability debate
- Enthusiasts report 10x speedups on greenfield work and huge gains for tests, refactors, boilerplate, and multi-layer design iteration.
- Others find the experience brittle beyond a few hundred LOC, with agents getting stuck, degrading over long conversations, or running off on tangents.
- “Vibe coding” (accept-all, error–LLM–error loops) is sharply contested:
- Fans liken it to surfing and claim it works well for CRUD/throwaway apps.
- Critics call it “monkeys with knives,” stressing maintainability, outages, and lost learning for juniors.
- Broad agreement that LLM use is a learned skill; success depends on coaching the model, picking the right model/tooling combo, and keeping a human firmly “in the loop.”
Safety, economics, and ecosystem
- Letting agents run bash/install tools is viewed as powerful but risky; some rely on containers and version control, others worry about trivial payloads via shell.
- Concerns about cost and API pricing (especially for reasoning models); some users dodge this via UI plans or cheaper models.
- Many note that 90% reliability is far from production‑grade; “the last 10%” (and beyond) grows exponentially harder, though reinforcement learning and monitoring agents show promise.