The quality of AI-assisted software depends on unit of work management
Perceived Capability of Coding Agents Today
- Many see incremental, not revolutionary, gains vs a year ago: agents still reliably do only “intern-level” tasks, often ~50% success even on small changes, with frequent hallucinations and misreads of requirements.
- A minority report large productivity boosts (5–10x) in well-trodden domains like framework-based web CRUD, but acknowledge close supervision is required.
- Several compare current tools to “very expensive IntelliSense”: helpful autocomplete and boilerplate generation, but far from autonomous coding.
- Strong skepticism towards recurring claims that “this new model finally doesn’t suck,” attributed to a hedonic treadmill: users quickly adapt and then notice remaining limits.
Best Uses: Code Review, Exploration, and Tests
- Broad agreement that LLMs are much stronger at going from code → English than the reverse.
- Popular uses: code review, explaining unfamiliar codebases, suggesting edge-case tests, exploring APIs/platforms, writing test scaffolding, and sketching prototypes where quality/maintenance don’t matter.
- Some developers find reviewing AI output mentally draining and ownership-reducing; others find it easier than starting from scratch and like the lack of social friction compared to human code review.
Unit of Work Size and Context Management
- Consensus that small, well-scoped tasks work best; large “agentic” changes often degrade into confusion, breakages, and escalating fixes.
- A common pattern: finish one task, then summarize changes into a small text artifact and start a fresh context for the next task, instead of running long multi-step sessions.
- Users report that tools’ automatic compaction (e.g., after large contexts) often correlates with quality collapse: redoing completed work, misinterpreting the state, or “destroying” working code.
Planning, TDD, and Agents’ Inability to Follow Plans
- Many describe agents as “ambitious high-schooler/junior dev” level: can write functions, but are poor at reliably executing multi-step plans, running tests, or adhering to TDD without constant correction.
- Some have invested hundreds to thousands of hours developing bespoke workflows (TDD-heavy, strict supervision, Plan/Act modes) and report good results at scale, but this “AI management” skill itself is substantial overhead.
- Others argue that if effective use requires that much micromanagement, it’s not a net productivity win for typical day-to-day coding.
User Stories, Architecture, and Units of Delivery
- Debate over vertical “user story” slices vs horizontal architectural layers as the right unit of work.
- One camp: vertical, customer-facing slices are crucial to validate business value early and reduce risk.
- The other: feature-led slicing produces fragile, hard-to-change systems; robust design should be layered and cross-cutting, with features emerging from composed infrastructure.
Overall Strategy
- A recurring recommendation: don’t spend energy on clever prompting to make agents do everything.
- Instead, focus on knowing when not to use them, and limit usage to tasks where you can quickly verify correctness or where the value is primarily understanding, not generation.