2026-02-05

We tasked Opus 4.6 using agent teams to build a C Compiler

Overall reaction

Many find the result astonishing: 16 agents plus ~2,000 sessions and ~$20k produced a ~100k‑LOC Rust C compiler that can build Linux 6.9 (x86, ARM, RISC‑V), QEMU, Postgres, SQLite, Doom, etc.
Others see it as more of a flashy demo than a practically useful compiler, and emphasize that this is an ideal, highly‑constrained problem.

Quality, correctness, and performance

Multiple commenters stress the compiler is slower than GCC even at -O0, sometimes fails on real code (e.g. “hello world” without extra include paths), and reportedly accepts type‑incorrect C.
It is described as brittle: unclear how it behaves on different kernel versions or broader codebases; extending it often broke existing functionality.
Several argue that such an artifact is impressive and basically unusable in production; many say they would rather rewrite than maintain “LLM slop.”

Training data and “clean‑room” controversy

Strong disagreement over calling it “clean‑room”:
- Critics: the model was trained on GCC/Clang and many compilers; using GCC as a correctness oracle and matching its behavior disqualifies it as clean‑room in the legal/engineering sense.
- Defenders: the output is not a verbatim copy, is in Rust, and uses general compiler knowledge rather than regurgitated code.
Linked research on near‑verbatim book extraction fuels concerns that models can memorize significant training data.

Role of tests, oracles, and problem choice

Many note this is the best‑case AI coding task: a well‑specified language, mature specs, enormous test suites, and an existing compiler as oracle.
The GCC‑oracle harness and heavy test‑driven iteration are viewed as the real enablers; without this, the system got stuck or regressed.
Some generalize: for any domain with strong automated tests, agentic coding can “fit” an implementation to the tests, akin to model training.

Cost, productivity, and employment

Debate over whether $20k for this result is cheap or expensive compared to humans (e.g., “one good dev in a few months” vs “no one builds this in two weeks”).
Thread reflects wider AI backlash: some see this as marketing aimed at justifying layoffs; others see it as a clear signal that a large fraction of software work is at risk, though not the hardest 1–10%.

Related topics