Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed
Edit Addressing: Line Numbers, Hashes, and Structure
- Several commenters compare the post’s hash-per-line scheme to simpler “line numbers only” addressing.
- Line numbers are more compact but fragile when files change between read and write or after multiple edits.
- Hashes (or hash-like tags) make edits robust to shifting lines and avoid clobbering mismatched content.
- Some worry about loss of concurrency: search/replace lets multiple edits proceed independently; line- or hash-based schemes can serialize writes and require more reindexing. Others report that in practice serialization is fine and token savings are worth it.
- Alternatives discussed:
- TOC-style “content_point” references per symbol or function.
- Tree-sitter / AST tools that list and update nodes by IDs or hashes.
- Fuzzy matching (e.g., Damerau–Levenshtein) to confirm intended replacements rather than requiring exact matches.
Harness as Primary Leverage Point
- Strong agreement that the “harness” (tools, context management, edit protocol, feedback loop) often matters more than model choice.
- Same model can jump from “barely usable” to “legitimately helpful” with better context and edit tools.
- Benchmarks like CORE, TerminalBench, and browser agents show large swings in scores purely from harness changes.
- Some frame the real “AI system” as LLM + harness + human-in-the-loop, a cybernetic or neurosymbolic whole rather than just the model.
- Many expect future developers to spend more time designing harnesses and workflows than hand-writing code.
Closed Harnesses, Subscriptions, and Lock‑in
- Big debate over proprietary harnesses (e.g., IDE integrations, terminal agents) tied to flat-rate subscriptions.
- One side sees lock-in, telemetry, future “enshitification,” and incentives to waste tokens.
- Others report subscriptions only improving so far and consider price hikes relatively insignificant for professionals.
- Several want OAuth-based access: use any harness with a monthly plan instead of being forced into the vendor’s UI.
- Economic angle: subscriptions are subsidized/oversubscribed “loss leaders,” whereas raw API tokens are priced higher.
Bans, Sovereign Models, and Trust
- The author’s loss of access to consumer endpoints (for using them via a custom harness) prompts discussion:
- Some say using unpublished/subsidized endpoints this way is understandably disallowed.
- Others see it as arbitrary, similar to platform bans, reinforcing the need for self-hostable “sovereign” models and open harnesses.
- Side debate over large labs’ historic scraping behavior and current claims of respecting robots.txt.
Limitations and Skepticism About the Results
- Several commenters think the technique is promising but oversold.
- The benchmark is narrow (find-and-replace style edits); a 5–14 point boost there may translate to only modest real-world gains.
- Desire for analysis that separates pure harness failures from reasoning failures.
- Note that some existing systems (e.g., Codex) already use constrained grammars for patches, so comparisons may be incomplete.
Broader Reflections on Coding Agents
- Multiple accounts confirm that modest harness tweaks (better edit tools, repo maps, validation steps) massively improve reliability, especially for security-sensitive changes.
- There’s ongoing confusion about “best” coding harnesses; some users are gravitating toward lightweight, extensible OSS agents and even writing their own.
- Longer-term concerns: dependence on a few vendors that can deplatform users, and wider societal impacts if AI-assisted coding accelerates job displacement.