Scaling long-running autonomous coding

Use of libraries and “from scratch” claims

  • Commenters note substantial reliance on existing libraries (e.g., HTML and CSS layout crates), questioning the “from scratch” framing.
  • Some say this doesn’t materially reduce the achievement as a demo of what agents can do; others see it as the strongest argument that this is more “glue + wrappers” than a new engine.

Correctness, testing, and verification gaps

  • Multiple people highlight that rendering something is easy; doing it fast, correct, and secure is the hard part.
  • There’s frustration that the experiment write-up says little about systematic testing: use of web-platform-tests, fuzzing with random pages, crash feedback loops, etc.
  • Several predict that as code generation gets cheaper, most effort will shift to specification and automated verification.

Autonomy vs human-guided architecture

  • A recurring theme: autonomous agents can write lots of code, but produce incoherent, conceptually weak architectures.
  • A browser engineer dissects subsystems (e.g., IndexedDB) and argues the design can’t evolve into a real multi-process engine; shared Arc<Mutex<...>> state and odd rendering loops are cited as examples that diverge from web standards.
  • Proposed alternative: humans define architecture and constraints, agents handle implementation details within modular, human-reviewed loops—more like a traditional open-source project.

Maintainability and lifecycle concerns

  • Several report AI-generated repos full of duplication and brittle quick fixes; maintainability is an open question, especially beyond a few months.
  • Some speculate that future, better models might “clean up” older slop; others counter that current outputs are essentially throwaway.
  • Questions are raised about how “autonomous” the week-long run really was and what human interventions occurred.

Browser as benchmark vs real-world relevance

  • One side: browsers are among the most complex software systems; even partial success is a strong capability signal.
  • Other side: this is an unusually favorable domain—clear specs, exhaustive tests, reference implementations, decomposable components, and models already trained on many browsers. Most real-world problems lack these properties.

Costs, impact, and philosophy

  • Token usage in the trillions (implying multi‑million‑dollar spend) divides opinion: some see it as cheaper than a team, others say the resulting code isn’t worth even cents.
  • Environmental and system-level cost comparisons (humans vs GPUs, datacenters, food, education) are deemed extremely complex.
  • Long subthreads debate whether LLMs are “just remixers/statistical parrots” versus a nascent form of intelligence; there’s no consensus, but several stress that usefulness doesn’t require “true” understanding.
  • Many conclude that tests, specs, and project context (docs, embedded standards) are the real long-term assets; raw code is increasingly commodity.