The quality of AI-assisted software depends on unit of work management

Perceived Capability of Coding Agents Today

  • Many see incremental, not revolutionary, gains vs a year ago: agents still reliably do only “intern-level” tasks, often ~50% success even on small changes, with frequent hallucinations and misreads of requirements.
  • A minority report large productivity boosts (5–10x) in well-trodden domains like framework-based web CRUD, but acknowledge close supervision is required.
  • Several compare current tools to “very expensive IntelliSense”: helpful autocomplete and boilerplate generation, but far from autonomous coding.
  • Strong skepticism towards recurring claims that “this new model finally doesn’t suck,” attributed to a hedonic treadmill: users quickly adapt and then notice remaining limits.

Best Uses: Code Review, Exploration, and Tests

  • Broad agreement that LLMs are much stronger at going from code → English than the reverse.
  • Popular uses: code review, explaining unfamiliar codebases, suggesting edge-case tests, exploring APIs/platforms, writing test scaffolding, and sketching prototypes where quality/maintenance don’t matter.
  • Some developers find reviewing AI output mentally draining and ownership-reducing; others find it easier than starting from scratch and like the lack of social friction compared to human code review.

Unit of Work Size and Context Management

  • Consensus that small, well-scoped tasks work best; large “agentic” changes often degrade into confusion, breakages, and escalating fixes.
  • A common pattern: finish one task, then summarize changes into a small text artifact and start a fresh context for the next task, instead of running long multi-step sessions.
  • Users report that tools’ automatic compaction (e.g., after large contexts) often correlates with quality collapse: redoing completed work, misinterpreting the state, or “destroying” working code.

Planning, TDD, and Agents’ Inability to Follow Plans

  • Many describe agents as “ambitious high-schooler/junior dev” level: can write functions, but are poor at reliably executing multi-step plans, running tests, or adhering to TDD without constant correction.
  • Some have invested hundreds to thousands of hours developing bespoke workflows (TDD-heavy, strict supervision, Plan/Act modes) and report good results at scale, but this “AI management” skill itself is substantial overhead.
  • Others argue that if effective use requires that much micromanagement, it’s not a net productivity win for typical day-to-day coding.

User Stories, Architecture, and Units of Delivery

  • Debate over vertical “user story” slices vs horizontal architectural layers as the right unit of work.
  • One camp: vertical, customer-facing slices are crucial to validate business value early and reduce risk.
  • The other: feature-led slicing produces fragile, hard-to-change systems; robust design should be layered and cross-cutting, with features emerging from composed infrastructure.

Overall Strategy

  • A recurring recommendation: don’t spend energy on clever prompting to make agents do everything.
  • Instead, focus on knowing when not to use them, and limit usage to tasks where you can quickly verify correctness or where the value is primarily understanding, not generation.