Thoughts on a month with Devin

Devin and agentic workflows: strengths

  • When tasks are well-scoped, stacks are mainstream, and tests are easy to run, Devin can produce clean, test‑passing code and handle multi-file changes autonomously.
  • The “agent in a Slack/terminal” UX and closed-loop workflows (edit → run tests → iterate) impressed many and shifted expectations about what’s possible.
  • Some see current results as analogous to early image generation: rough now, but the mere fact it works at all suggests large future upside.

Major limitations and failure modes

  • Tends to make extraneous edits beyond the request, sometimes breaking unrelated functionality, and is bad at rolling those changes back.
  • Often gets stuck in “infinite thinking loops,” working for hours or days instead of asking for help, especially on “soft stops.”
  • Poor at admitting incapacity or integrating coaching; frequently compared to the worst stereotype of an overconfident junior dev.
  • Subtle, hard‑to‑spot mistakes (e.g., silently truncating a license header) undermine trust.
  • Users report no reliable way to predict which tasks it will succeed on, limiting its value as a tool.

Agents vs. narrower tools

  • Many argue Devin overreaches; narrower agents focused on bug fixes, small features, test/CI cleanup, or maintenance show much higher success rates and real enterprise interest.
  • Constrained agents and IDE‑integrated tools (Cursor, Copilot, Aider, OpenHands, others) are seen as more practical: they act as “power tools,” not replacements.
  • There’s discussion of orchestrators, time/“energy” limits, and supervisors (even non‑LLM models) to detect when an agent is stuck and halt or escalate.

Where LLM coding helps today

  • Explaining legacy or “arcane” code, proposing refactors, and writing tests with many edge cases.
  • Generating small, next‑step snippets in data science, SQL, matplotlib, shell/HTTP work, and onboarding to new technologies.
  • Automating tedious chores: merge conflicts, linter fixes, parameter reshuffling, multi‑file refactors.

Models, hallucinations, and trajectory

  • Some report newer reasoning models (e.g., o1 variants, Claude Sonnet 3.5) hallucinate less for coding with good prompts and short context; others still feel “burned” and prefer to write code themselves.
  • Debate over whether progress is still rapid or already hitting diminishing returns.
  • Broad agreement that AI cannot yet replace engineers; its output needs review comparable to a brand‑new hire.
  • Many expect continued pressure from companies to cut headcount using AI, with disagreement on how far that will actually go.