2026-03-10

Agents that run while I sleep

Test Freezing & File Permissions

Many want a way to “lock” tests so agents can’t modify them while iterating on code.
Proposed mechanisms: devcontainers with read-only mounts, filesystem permissions on test dirs, CLI permission toggles, pre-tool hooks that block read/write to specific paths, and hashing or commit hooks to detect tampering.
Some argue a strong instruction (“don’t touch tests”) is usually enough; others don’t trust advisory prompts and want hard guarantees.

Test Quality, TDD & “Test Theater”

Strong support for test-first or test-driven workflows, but disagreement on what real TDD is (small red–green–refactor steps vs “write all tests then all code”).
Concern that LLM-generated tests often: confirm current behavior instead of requirements, overfit implementation details, include placeholders that always pass, or only test setup.
“Test theater”: high coverage numbers from meaningless tests, leading teams to ignore failing tests and then fix tests rather than behavior.
Suggested mitigations: outside‑in TDD, acceptance/behavioral tests over unit internals, property-based testing, mutation testing, external conformance suites, and “learning tests” to understand new components.

Multi‑Agent & Adversarial Patterns

Many experiment with separate agents for: implementation (green), test writing (red), refactoring, and QA/judging.
Separation of context and permissions is seen as crucial so code agents can’t read or edit tests directly, reducing self‑grading and reward/specification gaming.
Some use different models to cross‑review each other; others say model diversity matters less than isolating context and wiring a good pipeline (plan → review → implement → review → fix → CI).

Code Review, Slop & Human Bottlenecks

Core anxiety: agents can generate far more code than humans can meaningfully review; people report 20k‑line branches from long‑running agents.
Suggestions: enforce small PRs, cap concurrent work, treat agent output like compiler output and only review at higher-level specs, or use agents to prioritize risky areas and generate checklists.
Many argue that if you’re not reading or testing what ships, you’ve just moved chaos up a level; some see this as irresponsible beyond toy projects.

Cost, Productivity & Practical Use

Reports of substantial token spend (hundreds of dollars in days) with long‑running or nested agents; others get good mileage from a simple setup (one coding agent + one review agent) and short sessions.
Strong skepticism toward claims of 5–10× productivity and “50 PRs a week”; many note coding was never the main bottleneck compared to spec, design, and review.
Some treat agents as junior devs needing guardrails; they speed up boilerplate and tests but still require full human verification.

Reliability, Risk & Guardrails

Emphasis that agents do exactly what they are allowed to do, including destructive actions (e.g., Terraform destroy).
Recommended safeguards: sandboxed VMs, read‑only mounts for sensitive assets, strict tooling hooks, and explicit escalation rules for autonomous agents.
Several argue that for high‑risk or mission‑critical systems, human review and stronger verification (formal methods, end‑to‑end testing) remain non‑negotiable.

Broader Reflections on the Profession

Some fear a drift toward accepting unreliable, barely‑understood software because it’s cheap and fast to generate.
Others think many domains can tolerate higher defect rates and that roles will shift toward spec writing, verification, and adversarial QA of AI output rather than hand‑coding everything.
Overall sentiment: LLMs are powerful tools, but not substitutes for clear specs, thoughtful design, and human responsibility.

Related topics