Scaling long-running autonomous coding
Use of libraries and “from scratch” claims
- Commenters note substantial reliance on existing libraries (e.g., HTML and CSS layout crates), questioning the “from scratch” framing.
- Some say this doesn’t materially reduce the achievement as a demo of what agents can do; others see it as the strongest argument that this is more “glue + wrappers” than a new engine.
Correctness, testing, and verification gaps
- Multiple people highlight that rendering something is easy; doing it fast, correct, and secure is the hard part.
- There’s frustration that the experiment write-up says little about systematic testing: use of web-platform-tests, fuzzing with random pages, crash feedback loops, etc.
- Several predict that as code generation gets cheaper, most effort will shift to specification and automated verification.
Autonomy vs human-guided architecture
- A recurring theme: autonomous agents can write lots of code, but produce incoherent, conceptually weak architectures.
- A browser engineer dissects subsystems (e.g., IndexedDB) and argues the design can’t evolve into a real multi-process engine; shared
Arc<Mutex<...>>state and odd rendering loops are cited as examples that diverge from web standards. - Proposed alternative: humans define architecture and constraints, agents handle implementation details within modular, human-reviewed loops—more like a traditional open-source project.
Maintainability and lifecycle concerns
- Several report AI-generated repos full of duplication and brittle quick fixes; maintainability is an open question, especially beyond a few months.
- Some speculate that future, better models might “clean up” older slop; others counter that current outputs are essentially throwaway.
- Questions are raised about how “autonomous” the week-long run really was and what human interventions occurred.
Browser as benchmark vs real-world relevance
- One side: browsers are among the most complex software systems; even partial success is a strong capability signal.
- Other side: this is an unusually favorable domain—clear specs, exhaustive tests, reference implementations, decomposable components, and models already trained on many browsers. Most real-world problems lack these properties.
Costs, impact, and philosophy
- Token usage in the trillions (implying multi‑million‑dollar spend) divides opinion: some see it as cheaper than a team, others say the resulting code isn’t worth even cents.
- Environmental and system-level cost comparisons (humans vs GPUs, datacenters, food, education) are deemed extremely complex.
- Long subthreads debate whether LLMs are “just remixers/statistical parrots” versus a nascent form of intelligence; there’s no consensus, but several stress that usefulness doesn’t require “true” understanding.
- Many conclude that tests, specs, and project context (docs, embedded standards) are the real long-term assets; raw code is increasingly commodity.