2025-07-28

Tao on “blue team” vs. “red team” LLMs

Role of LLMs in Testing and Code Generation

Strong disagreement on whether it’s “safe” to let LLMs generate lots of tests.
- Pro side: tests are cheap, easy to delete, and LLMs often suggest extra edge cases humans skip. Some teams allow AI-generated tests but require human review and keep them separate from “expert” tests.
- Con side: in large/legacy codebases, tests are de facto source of truth, and wrong tests are worse than wrong code. Brittle or low-quality tests become “change detectors” that fail on harmless refactors, slow development, and create ambiguity about whether a failure is a real bug or a bad test.
Long subthread on TDD, what counts as a “unit,” and how tightly tests should couple to implementation details.
Fuzzing is discussed as an alternative/adjacent strategy: good at surfacing unexpected state-machine, memory, and parsing bugs, but can lead to piles of opaque regression tests if not curated.

Are Tests the Spec? What Is the Source of Truth?

One camp: tests are the specification for humans and machines; extra natural-language specs just add drift and busywork.
Opposing camp: tests, code, docs, and people’s memories are four imperfect caches of an underlying intent; none is a single source of truth. Tests are at best an approximation of the spec and can never fully cover complex input spaces.
Several comments stress documenting why a test exists and what behavior matters, not just the “what.”

Red vs Blue Team Analogy for LLMs

Many agree LLMs are more trustworthy as “red team” tools: critics, reviewers, fuzzers, security probes, log analyzers, and adversarial test generators—especially where there’s a clear oracle or verifier.
Others report success with agentic workflows where LLMs do both: implement features and then aggressively test and attack their own work.
Some argue the real pattern today is the opposite: LLMs rapidly draft (blue), humans review (red), particularly because humans are better at subtle, global judgment than at spotting every local bug.

Security Analogies and Defense-in-Depth

Debate over “a system is only as strong as its weakest link”:
- Some say this oversimplifies severity levels and defense-in-depth; layered security can mitigate single weak points.
- Others respond that weakest links (e.g., password reset processes) are still common real-world entry points; defense-in-depth is a response to, not a refutation of, that fact.

Broader Concerns and Meta Points

Worry that LLM-assisted testing amplifies bureaucratic, low-value work (mock-heavy, pointless tests) rather than true quality.
Observations that top practitioners become much stronger with LLMs, while weaker ones lean on them to produce “slop.”
Multiple comparisons to GANs, game-theoretic adversaries, chaos engineering, and editor–author workflows as precedents for red/blue-style setups.

Related topics