Scaling long-running autonomous coding
Agent architecture & context windows
- Discussion assumes a multi-agent setup: planner/manager agents divide work into modules, with worker agents handling specific tasks and tests.
- Large context windows matter less than expected; agents lean on tools like grep, file-based plans, and local indexing to operate on codebases larger than their immediate context.
- Some report success with “project manager” specs (e.g.,
agents.md) and hierarchical/planner–worker patterns, including 3-layer pipelines (prompt → plan → tasks → workers).
Browser experiment: capabilities vs reality
- Many are impressed that agents can assemble a browser-like engine at all, given the complexity of specs, edge cases, performance, and interoperability.
- Others point out the repository often doesn’t compile, CI is frequently red, and there’s no clear “known-good” commit corresponding to the demo.
- The implementation relies heavily on existing crates (JS engine, windowing, graphics, layout, CSS engines), so “from scratch” is viewed as marketing rather than literal.
- Some suspect it tracks closely with existing Rust browsers and toy engines available online.
Code quality, maintainability, and convergence
- Multiple commenters describe the code as brittle, warning-filled, and hard to navigate: many tiny files, unclear architecture, weak docs.
- Agents appear to ignore compiler warnings, and PRs with failing CI are merged—seen as human-like sloppiness, not rigor.
- Several note that autonomous agents tend to diverge into “monstrosities” rather than converge, unless tightly steered by humans.
Usefulness, evaluation, and missing details
- The lack of a merged, production-grade PR or running public demo makes some see this as primarily a marketing/hype piece.
- Calls for more grounded benchmarks: gradually harder projects, long-lived systems with real users and lower bug rates than human-written equivalents, or tasks with post-training repos (e.g., swe-REbench-style).
- Cost is highlighted as a missing metric: billions of tokens are mentioned, but no clear accounting of dollars per working feature/test.
Broader implications and sentiment
- Optimists see a path to cheap software where cost is mostly tokens + hardware, with humans focusing on product management and specification.
- Skeptics emphasize that understanding user needs, specifying requirements, and reviewing tests/code remain bottlenecks that agents don’t remove.
- Several report strong productivity from human-in-the-loop “vibe coding” for small/medium projects, but persistent failure on complex scientific/edge-case-heavy tasks.
- Overall tone is mixed: awe at what’s possible already, and deep distrust of claims of full autonomy and “from-scratch” complexity.