Agents that run while I sleep

Test Freezing & File Permissions

  • Many want a way to “lock” tests so agents can’t modify them while iterating on code.
  • Proposed mechanisms: devcontainers with read-only mounts, filesystem permissions on test dirs, CLI permission toggles, pre-tool hooks that block read/write to specific paths, and hashing or commit hooks to detect tampering.
  • Some argue a strong instruction (“don’t touch tests”) is usually enough; others don’t trust advisory prompts and want hard guarantees.

Test Quality, TDD & “Test Theater”

  • Strong support for test-first or test-driven workflows, but disagreement on what real TDD is (small red–green–refactor steps vs “write all tests then all code”).
  • Concern that LLM-generated tests often: confirm current behavior instead of requirements, overfit implementation details, include placeholders that always pass, or only test setup.
  • “Test theater”: high coverage numbers from meaningless tests, leading teams to ignore failing tests and then fix tests rather than behavior.
  • Suggested mitigations: outside‑in TDD, acceptance/behavioral tests over unit internals, property-based testing, mutation testing, external conformance suites, and “learning tests” to understand new components.

Multi‑Agent & Adversarial Patterns

  • Many experiment with separate agents for: implementation (green), test writing (red), refactoring, and QA/judging.
  • Separation of context and permissions is seen as crucial so code agents can’t read or edit tests directly, reducing self‑grading and reward/specification gaming.
  • Some use different models to cross‑review each other; others say model diversity matters less than isolating context and wiring a good pipeline (plan → review → implement → review → fix → CI).

Code Review, Slop & Human Bottlenecks

  • Core anxiety: agents can generate far more code than humans can meaningfully review; people report 20k‑line branches from long‑running agents.
  • Suggestions: enforce small PRs, cap concurrent work, treat agent output like compiler output and only review at higher-level specs, or use agents to prioritize risky areas and generate checklists.
  • Many argue that if you’re not reading or testing what ships, you’ve just moved chaos up a level; some see this as irresponsible beyond toy projects.

Cost, Productivity & Practical Use

  • Reports of substantial token spend (hundreds of dollars in days) with long‑running or nested agents; others get good mileage from a simple setup (one coding agent + one review agent) and short sessions.
  • Strong skepticism toward claims of 5–10× productivity and “50 PRs a week”; many note coding was never the main bottleneck compared to spec, design, and review.
  • Some treat agents as junior devs needing guardrails; they speed up boilerplate and tests but still require full human verification.

Reliability, Risk & Guardrails

  • Emphasis that agents do exactly what they are allowed to do, including destructive actions (e.g., Terraform destroy).
  • Recommended safeguards: sandboxed VMs, read‑only mounts for sensitive assets, strict tooling hooks, and explicit escalation rules for autonomous agents.
  • Several argue that for high‑risk or mission‑critical systems, human review and stronger verification (formal methods, end‑to‑end testing) remain non‑negotiable.

Broader Reflections on the Profession

  • Some fear a drift toward accepting unreliable, barely‑understood software because it’s cheap and fast to generate.
  • Others think many domains can tolerate higher defect rates and that roles will shift toward spec writing, verification, and adversarial QA of AI output rather than hand‑coding everything.
  • Overall sentiment: LLMs are powerful tools, but not substitutes for clear specs, thoughtful design, and human responsibility.