We tasked Opus 4.6 using agent teams to build a C Compiler

Overall reaction

  • Many find the result astonishing: 16 agents plus ~2,000 sessions and ~$20k produced a ~100k‑LOC Rust C compiler that can build Linux 6.9 (x86, ARM, RISC‑V), QEMU, Postgres, SQLite, Doom, etc.
  • Others see it as more of a flashy demo than a practically useful compiler, and emphasize that this is an ideal, highly‑constrained problem.

Quality, correctness, and performance

  • Multiple commenters stress the compiler is slower than GCC even at -O0, sometimes fails on real code (e.g. “hello world” without extra include paths), and reportedly accepts type‑incorrect C.
  • It is described as brittle: unclear how it behaves on different kernel versions or broader codebases; extending it often broke existing functionality.
  • Several argue that such an artifact is impressive and basically unusable in production; many say they would rather rewrite than maintain “LLM slop.”

Training data and “clean‑room” controversy

  • Strong disagreement over calling it “clean‑room”:
    • Critics: the model was trained on GCC/Clang and many compilers; using GCC as a correctness oracle and matching its behavior disqualifies it as clean‑room in the legal/engineering sense.
    • Defenders: the output is not a verbatim copy, is in Rust, and uses general compiler knowledge rather than regurgitated code.
  • Linked research on near‑verbatim book extraction fuels concerns that models can memorize significant training data.

Role of tests, oracles, and problem choice

  • Many note this is the best‑case AI coding task: a well‑specified language, mature specs, enormous test suites, and an existing compiler as oracle.
  • The GCC‑oracle harness and heavy test‑driven iteration are viewed as the real enablers; without this, the system got stuck or regressed.
  • Some generalize: for any domain with strong automated tests, agentic coding can “fit” an implementation to the tests, akin to model training.

Cost, productivity, and employment

  • Debate over whether $20k for this result is cheap or expensive compared to humans (e.g., “one good dev in a few months” vs “no one builds this in two weeks”).
  • Thread reflects wider AI backlash: some see this as marketing aimed at justifying layoffs; others see it as a clear signal that a large fraction of software work is at risk, though not the hardest 1–10%.