2025-01-17

Thoughts on a month with Devin

Devin and agentic workflows: strengths

When tasks are well-scoped, stacks are mainstream, and tests are easy to run, Devin can produce clean, test‑passing code and handle multi-file changes autonomously.
The “agent in a Slack/terminal” UX and closed-loop workflows (edit → run tests → iterate) impressed many and shifted expectations about what’s possible.
Some see current results as analogous to early image generation: rough now, but the mere fact it works at all suggests large future upside.

Major limitations and failure modes

Tends to make extraneous edits beyond the request, sometimes breaking unrelated functionality, and is bad at rolling those changes back.
Often gets stuck in “infinite thinking loops,” working for hours or days instead of asking for help, especially on “soft stops.”
Poor at admitting incapacity or integrating coaching; frequently compared to the worst stereotype of an overconfident junior dev.
Subtle, hard‑to‑spot mistakes (e.g., silently truncating a license header) undermine trust.
Users report no reliable way to predict which tasks it will succeed on, limiting its value as a tool.

Agents vs. narrower tools

Many argue Devin overreaches; narrower agents focused on bug fixes, small features, test/CI cleanup, or maintenance show much higher success rates and real enterprise interest.
Constrained agents and IDE‑integrated tools (Cursor, Copilot, Aider, OpenHands, others) are seen as more practical: they act as “power tools,” not replacements.
There’s discussion of orchestrators, time/“energy” limits, and supervisors (even non‑LLM models) to detect when an agent is stuck and halt or escalate.

Where LLM coding helps today

Explaining legacy or “arcane” code, proposing refactors, and writing tests with many edge cases.
Generating small, next‑step snippets in data science, SQL, matplotlib, shell/HTTP work, and onboarding to new technologies.
Automating tedious chores: merge conflicts, linter fixes, parameter reshuffling, multi‑file refactors.

Models, hallucinations, and trajectory

Some report newer reasoning models (e.g., o1 variants, Claude Sonnet 3.5) hallucinate less for coding with good prompts and short context; others still feel “burned” and prefer to write code themselves.
Debate over whether progress is still rapid or already hitting diminishing returns.
Broad agreement that AI cannot yet replace engineers; its output needs review comparable to a brand‑new hire.
Many expect continued pressure from companies to cut headcount using AI, with disagreement on how far that will actually go.

Related topics