LLMs work best when the user defines their acceptance criteria first
Role of Acceptance Criteria and Testing
- Many argue LLMs work best when acceptance criteria are explicit: performance targets, formats, invariants, and tests.
- Threads emphasize writing tests or benchmarks first (TDD-like) so the model can iterate against a measurable definition of “correct.”
- Some use automated “evals” and invariant tools to enforce constraints at every generation step, not just once.
LLM Code Quality: Plausible vs Correct
- Broad agreement that LLMs produce “plausible” code: syntactically fine, often functionally passing simple tests, but with hidden bugs or terrible performance.
- Case studies: a Rust SQLite clone that passes tests but is orders of magnitude slower; a fleur‑de‑lis drawing task where models flounder on novel procedural geometry; naive S3→Postgres imports that miss efficient mechanisms.
- Several note that humans also write plausible-but-buggy code; 100% correctness was never the real bar.
Effective Usage Patterns and Workflows
- Best outcomes come from treating models like junior devs: specify constraints, architecture, and acceptance tests; make them plan first; decompose into small tasks.
- Planning modes, “don’t code yet, ask clarifying questions,” and top‑down architecture design are repeatedly recommended.
- Some prefer LLMs as autocomplete for small snippets; large agentic rewrites are seen as brittle and hard to review.
Failure Modes and Limitations
- Common issues: code bloat, endless patching instead of refactoring, partial migrations, hallucinated APIs, weak tests/mocking, and compounding tech debt.
- Performance: models default to naive algorithms unless explicitly prompted to research and compare faster approaches.
- Visual and spatial tasks (SVG shapes, images) remain weak; proprietary or niche frameworks fare worse than mainstream stacks.
Productivity, Skills, and Workforce Impact
- Enthusiasts report 4–10x productivity, claiming LLM code can match or exceed typical enterprise quality when guided well.
- Skeptics counter that review, debugging, and architecture still dominate effort, so net gains are modest for hard problems.
- Debate over juniors: some see LLMs as accelerants; others worry trainees won’t develop deep understanding if they only “steer” agents.
Agents, Tools, and Future Directions
- Distinction made between raw LLMs, chatbots, and full agents with tools, memory, and code execution.
- Coding agents that can run benchmarks, tests, and refactors are seen as key to closing the gap between plausible and truly correct code.
- Some expect big gains from reinforcement-style finetuning in verifiable domains like code and math; others stress that hard‑won, battle‑tested designs still require time and human judgment.