2025-09-18

The quality of AI-assisted software depends on unit of work management

Perceived Capability of Coding Agents Today

Many see incremental, not revolutionary, gains vs a year ago: agents still reliably do only “intern-level” tasks, often ~50% success even on small changes, with frequent hallucinations and misreads of requirements.
A minority report large productivity boosts (5–10x) in well-trodden domains like framework-based web CRUD, but acknowledge close supervision is required.
Several compare current tools to “very expensive IntelliSense”: helpful autocomplete and boilerplate generation, but far from autonomous coding.
Strong skepticism towards recurring claims that “this new model finally doesn’t suck,” attributed to a hedonic treadmill: users quickly adapt and then notice remaining limits.

Best Uses: Code Review, Exploration, and Tests

Broad agreement that LLMs are much stronger at going from code → English than the reverse.
Popular uses: code review, explaining unfamiliar codebases, suggesting edge-case tests, exploring APIs/platforms, writing test scaffolding, and sketching prototypes where quality/maintenance don’t matter.
Some developers find reviewing AI output mentally draining and ownership-reducing; others find it easier than starting from scratch and like the lack of social friction compared to human code review.

Unit of Work Size and Context Management

Consensus that small, well-scoped tasks work best; large “agentic” changes often degrade into confusion, breakages, and escalating fixes.
A common pattern: finish one task, then summarize changes into a small text artifact and start a fresh context for the next task, instead of running long multi-step sessions.
Users report that tools’ automatic compaction (e.g., after large contexts) often correlates with quality collapse: redoing completed work, misinterpreting the state, or “destroying” working code.

Planning, TDD, and Agents’ Inability to Follow Plans

Many describe agents as “ambitious high-schooler/junior dev” level: can write functions, but are poor at reliably executing multi-step plans, running tests, or adhering to TDD without constant correction.
Some have invested hundreds to thousands of hours developing bespoke workflows (TDD-heavy, strict supervision, Plan/Act modes) and report good results at scale, but this “AI management” skill itself is substantial overhead.
Others argue that if effective use requires that much micromanagement, it’s not a net productivity win for typical day-to-day coding.

User Stories, Architecture, and Units of Delivery

Debate over vertical “user story” slices vs horizontal architectural layers as the right unit of work.
One camp: vertical, customer-facing slices are crucial to validate business value early and reduce risk.
The other: feature-led slicing produces fragile, hard-to-change systems; robust design should be layered and cross-cutting, with features emerging from composed infrastructure.

Overall Strategy

A recurring recommendation: don’t spend energy on clever prompting to make agents do everything.
Instead, focus on knowing when not to use them, and limit usage to tasks where you can quickly verify correctness or where the value is primarily understanding, not generation.

Related topics