Thoughts on a month with Devin
Devin and agentic workflows: strengths
- When tasks are well-scoped, stacks are mainstream, and tests are easy to run, Devin can produce clean, test‑passing code and handle multi-file changes autonomously.
- The “agent in a Slack/terminal” UX and closed-loop workflows (edit → run tests → iterate) impressed many and shifted expectations about what’s possible.
- Some see current results as analogous to early image generation: rough now, but the mere fact it works at all suggests large future upside.
Major limitations and failure modes
- Tends to make extraneous edits beyond the request, sometimes breaking unrelated functionality, and is bad at rolling those changes back.
- Often gets stuck in “infinite thinking loops,” working for hours or days instead of asking for help, especially on “soft stops.”
- Poor at admitting incapacity or integrating coaching; frequently compared to the worst stereotype of an overconfident junior dev.
- Subtle, hard‑to‑spot mistakes (e.g., silently truncating a license header) undermine trust.
- Users report no reliable way to predict which tasks it will succeed on, limiting its value as a tool.
Agents vs. narrower tools
- Many argue Devin overreaches; narrower agents focused on bug fixes, small features, test/CI cleanup, or maintenance show much higher success rates and real enterprise interest.
- Constrained agents and IDE‑integrated tools (Cursor, Copilot, Aider, OpenHands, others) are seen as more practical: they act as “power tools,” not replacements.
- There’s discussion of orchestrators, time/“energy” limits, and supervisors (even non‑LLM models) to detect when an agent is stuck and halt or escalate.
Where LLM coding helps today
- Explaining legacy or “arcane” code, proposing refactors, and writing tests with many edge cases.
- Generating small, next‑step snippets in data science, SQL, matplotlib, shell/HTTP work, and onboarding to new technologies.
- Automating tedious chores: merge conflicts, linter fixes, parameter reshuffling, multi‑file refactors.
Models, hallucinations, and trajectory
- Some report newer reasoning models (e.g., o1 variants, Claude Sonnet 3.5) hallucinate less for coding with good prompts and short context; others still feel “burned” and prefer to write code themselves.
- Debate over whether progress is still rapid or already hitting diminishing returns.
- Broad agreement that AI cannot yet replace engineers; its output needs review comparable to a brand‑new hire.
- Many expect continued pressure from companies to cut headcount using AI, with disagreement on how far that will actually go.