2026-03-07

LLMs work best when the user defines their acceptance criteria first

Role of Acceptance Criteria and Testing

Many argue LLMs work best when acceptance criteria are explicit: performance targets, formats, invariants, and tests.
Threads emphasize writing tests or benchmarks first (TDD-like) so the model can iterate against a measurable definition of “correct.”
Some use automated “evals” and invariant tools to enforce constraints at every generation step, not just once.

LLM Code Quality: Plausible vs Correct

Broad agreement that LLMs produce “plausible” code: syntactically fine, often functionally passing simple tests, but with hidden bugs or terrible performance.
Case studies: a Rust SQLite clone that passes tests but is orders of magnitude slower; a fleur‑de‑lis drawing task where models flounder on novel procedural geometry; naive S3→Postgres imports that miss efficient mechanisms.
Several note that humans also write plausible-but-buggy code; 100% correctness was never the real bar.

Effective Usage Patterns and Workflows

Best outcomes come from treating models like junior devs: specify constraints, architecture, and acceptance tests; make them plan first; decompose into small tasks.
Planning modes, “don’t code yet, ask clarifying questions,” and top‑down architecture design are repeatedly recommended.
Some prefer LLMs as autocomplete for small snippets; large agentic rewrites are seen as brittle and hard to review.

Failure Modes and Limitations

Common issues: code bloat, endless patching instead of refactoring, partial migrations, hallucinated APIs, weak tests/mocking, and compounding tech debt.
Performance: models default to naive algorithms unless explicitly prompted to research and compare faster approaches.
Visual and spatial tasks (SVG shapes, images) remain weak; proprietary or niche frameworks fare worse than mainstream stacks.

Productivity, Skills, and Workforce Impact

Enthusiasts report 4–10x productivity, claiming LLM code can match or exceed typical enterprise quality when guided well.
Skeptics counter that review, debugging, and architecture still dominate effort, so net gains are modest for hard problems.
Debate over juniors: some see LLMs as accelerants; others worry trainees won’t develop deep understanding if they only “steer” agents.

Agents, Tools, and Future Directions

Distinction made between raw LLMs, chatbots, and full agents with tools, memory, and code execution.
Coding agents that can run benchmarks, tests, and refactors are seen as key to closing the gap between plausible and truly correct code.
Some expect big gains from reinforcement-style finetuning in verifiable domains like code and math; others stress that hard‑won, battle‑tested designs still require time and human judgment.

Related topics