Agents that run while I sleep
Test Freezing & File Permissions
- Many want a way to “lock” tests so agents can’t modify them while iterating on code.
- Proposed mechanisms: devcontainers with read-only mounts, filesystem permissions on test dirs, CLI permission toggles, pre-tool hooks that block read/write to specific paths, and hashing or commit hooks to detect tampering.
- Some argue a strong instruction (“don’t touch tests”) is usually enough; others don’t trust advisory prompts and want hard guarantees.
Test Quality, TDD & “Test Theater”
- Strong support for test-first or test-driven workflows, but disagreement on what real TDD is (small red–green–refactor steps vs “write all tests then all code”).
- Concern that LLM-generated tests often: confirm current behavior instead of requirements, overfit implementation details, include placeholders that always pass, or only test setup.
- “Test theater”: high coverage numbers from meaningless tests, leading teams to ignore failing tests and then fix tests rather than behavior.
- Suggested mitigations: outside‑in TDD, acceptance/behavioral tests over unit internals, property-based testing, mutation testing, external conformance suites, and “learning tests” to understand new components.
Multi‑Agent & Adversarial Patterns
- Many experiment with separate agents for: implementation (green), test writing (red), refactoring, and QA/judging.
- Separation of context and permissions is seen as crucial so code agents can’t read or edit tests directly, reducing self‑grading and reward/specification gaming.
- Some use different models to cross‑review each other; others say model diversity matters less than isolating context and wiring a good pipeline (plan → review → implement → review → fix → CI).
Code Review, Slop & Human Bottlenecks
- Core anxiety: agents can generate far more code than humans can meaningfully review; people report 20k‑line branches from long‑running agents.
- Suggestions: enforce small PRs, cap concurrent work, treat agent output like compiler output and only review at higher-level specs, or use agents to prioritize risky areas and generate checklists.
- Many argue that if you’re not reading or testing what ships, you’ve just moved chaos up a level; some see this as irresponsible beyond toy projects.
Cost, Productivity & Practical Use
- Reports of substantial token spend (hundreds of dollars in days) with long‑running or nested agents; others get good mileage from a simple setup (one coding agent + one review agent) and short sessions.
- Strong skepticism toward claims of 5–10× productivity and “50 PRs a week”; many note coding was never the main bottleneck compared to spec, design, and review.
- Some treat agents as junior devs needing guardrails; they speed up boilerplate and tests but still require full human verification.
Reliability, Risk & Guardrails
- Emphasis that agents do exactly what they are allowed to do, including destructive actions (e.g., Terraform destroy).
- Recommended safeguards: sandboxed VMs, read‑only mounts for sensitive assets, strict tooling hooks, and explicit escalation rules for autonomous agents.
- Several argue that for high‑risk or mission‑critical systems, human review and stronger verification (formal methods, end‑to‑end testing) remain non‑negotiable.
Broader Reflections on the Profession
- Some fear a drift toward accepting unreliable, barely‑understood software because it’s cheap and fast to generate.
- Others think many domains can tolerate higher defect rates and that roles will shift toward spec writing, verification, and adversarial QA of AI output rather than hand‑coding everything.
- Overall sentiment: LLMs are powerful tools, but not substitutes for clear specs, thoughtful design, and human responsibility.