LLMs work best when the user defines their acceptance criteria first

Role of Acceptance Criteria and Testing

  • Many argue LLMs work best when acceptance criteria are explicit: performance targets, formats, invariants, and tests.
  • Threads emphasize writing tests or benchmarks first (TDD-like) so the model can iterate against a measurable definition of “correct.”
  • Some use automated “evals” and invariant tools to enforce constraints at every generation step, not just once.

LLM Code Quality: Plausible vs Correct

  • Broad agreement that LLMs produce “plausible” code: syntactically fine, often functionally passing simple tests, but with hidden bugs or terrible performance.
  • Case studies: a Rust SQLite clone that passes tests but is orders of magnitude slower; a fleur‑de‑lis drawing task where models flounder on novel procedural geometry; naive S3→Postgres imports that miss efficient mechanisms.
  • Several note that humans also write plausible-but-buggy code; 100% correctness was never the real bar.

Effective Usage Patterns and Workflows

  • Best outcomes come from treating models like junior devs: specify constraints, architecture, and acceptance tests; make them plan first; decompose into small tasks.
  • Planning modes, “don’t code yet, ask clarifying questions,” and top‑down architecture design are repeatedly recommended.
  • Some prefer LLMs as autocomplete for small snippets; large agentic rewrites are seen as brittle and hard to review.

Failure Modes and Limitations

  • Common issues: code bloat, endless patching instead of refactoring, partial migrations, hallucinated APIs, weak tests/mocking, and compounding tech debt.
  • Performance: models default to naive algorithms unless explicitly prompted to research and compare faster approaches.
  • Visual and spatial tasks (SVG shapes, images) remain weak; proprietary or niche frameworks fare worse than mainstream stacks.

Productivity, Skills, and Workforce Impact

  • Enthusiasts report 4–10x productivity, claiming LLM code can match or exceed typical enterprise quality when guided well.
  • Skeptics counter that review, debugging, and architecture still dominate effort, so net gains are modest for hard problems.
  • Debate over juniors: some see LLMs as accelerants; others worry trainees won’t develop deep understanding if they only “steer” agents.

Agents, Tools, and Future Directions

  • Distinction made between raw LLMs, chatbots, and full agents with tools, memory, and code execution.
  • Coding agents that can run benchmarks, tests, and refactors are seen as key to closing the gap between plausible and truly correct code.
  • Some expect big gains from reinforcement-style finetuning in verifiable domains like code and math; others stress that hard‑won, battle‑tested designs still require time and human judgment.