The unreasonable effectiveness of an LLM agent loop with tool use

Model quality and tool-calling behavior

  • Many reports of inconsistent coding quality across models:

    • Claude Sonnet 3.7 seen as powerful but prone to weird detours, test-skipping, and “just catch the exception”‑style fixes.
    • GPT‑4o/4.1 often break code, truncate files, or refuse to apply edits directly; 4o especially criticized for coding.
    • o1/o3 “reasoning” models described as uniquely good at handling ~1,000 LOC full‑file edits, but expensive and/or rate‑limited.
    • Gemini 2.5 Pro praised for intelligence and tool-calling, but some find it reluctant or clumsy with tools in certain UIs.
    • Mistral Medium 3 and some local/Qwen models seen as surprisingly strong for cost, especially via OpenRouter/Ollama.
  • Tool use itself is uneven: some models hallucinate diff formats, misuse deprecated packages despite “knowing” they’re deprecated, or claim to have called tools when they haven’t. Others, when wired to compilers/tests/shell, self‑correct effectively in loops.

Workflows, agents, and context management

  • Strong consensus that raw chat UI is the wrong interface for serious coding; dedicated tools (Claude Code, Cursor/Windsurf, Cline, Aider, Augment, Amazon Q, etc.) matter more than the base model alone.
  • Effective patterns:
    • Treat LLM as a junior/dev pair: write specs, ask for plans, phases, and tests first; then iterate.
    • Use agents to run commands, tests, and linters automatically in a loop, often inside containers or devcontainers.
    • Use git as the backbone: small branches, frequent commits, LLM-generated PRDs/PLAN docs, and multiple LLMs reviewing each other’s changes.
    • Libraries and mini-frameworks (nanoagent, toolkami, PocketFlow, custom MCP servers) implement the basic “while loop + tools” pattern for coding, text‑to‑SQL, REST/web search, device automation, etc.
  • Long-horizon reliability requires aggressive context control: pruning, custom MCP layers, guardrails, and “forgetting” past detail to avoid drift.

Productivity, “vibe coding,” and reliability debate

  • Enthusiasts report 10x speedups on greenfield work and huge gains for tests, refactors, boilerplate, and multi-layer design iteration.
  • Others find the experience brittle beyond a few hundred LOC, with agents getting stuck, degrading over long conversations, or running off on tangents.
  • “Vibe coding” (accept-all, error–LLM–error loops) is sharply contested:
    • Fans liken it to surfing and claim it works well for CRUD/throwaway apps.
    • Critics call it “monkeys with knives,” stressing maintainability, outages, and lost learning for juniors.
  • Broad agreement that LLM use is a learned skill; success depends on coaching the model, picking the right model/tooling combo, and keeping a human firmly “in the loop.”

Safety, economics, and ecosystem

  • Letting agents run bash/install tools is viewed as powerful but risky; some rely on containers and version control, others worry about trivial payloads via shell.
  • Concerns about cost and API pricing (especially for reasoning models); some users dodge this via UI plans or cheaper models.
  • Many note that 90% reliability is far from production‑grade; “the last 10%” (and beyond) grows exponentially harder, though reinforcement learning and monitoring agents show promise.