Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

Edit Addressing: Line Numbers, Hashes, and Structure

  • Several commenters compare the post’s hash-per-line scheme to simpler “line numbers only” addressing.
    • Line numbers are more compact but fragile when files change between read and write or after multiple edits.
    • Hashes (or hash-like tags) make edits robust to shifting lines and avoid clobbering mismatched content.
  • Some worry about loss of concurrency: search/replace lets multiple edits proceed independently; line- or hash-based schemes can serialize writes and require more reindexing. Others report that in practice serialization is fine and token savings are worth it.
  • Alternatives discussed:
    • TOC-style “content_point” references per symbol or function.
    • Tree-sitter / AST tools that list and update nodes by IDs or hashes.
    • Fuzzy matching (e.g., Damerau–Levenshtein) to confirm intended replacements rather than requiring exact matches.

Harness as Primary Leverage Point

  • Strong agreement that the “harness” (tools, context management, edit protocol, feedback loop) often matters more than model choice.
    • Same model can jump from “barely usable” to “legitimately helpful” with better context and edit tools.
    • Benchmarks like CORE, TerminalBench, and browser agents show large swings in scores purely from harness changes.
  • Some frame the real “AI system” as LLM + harness + human-in-the-loop, a cybernetic or neurosymbolic whole rather than just the model.
  • Many expect future developers to spend more time designing harnesses and workflows than hand-writing code.

Closed Harnesses, Subscriptions, and Lock‑in

  • Big debate over proprietary harnesses (e.g., IDE integrations, terminal agents) tied to flat-rate subscriptions.
    • One side sees lock-in, telemetry, future “enshitification,” and incentives to waste tokens.
    • Others report subscriptions only improving so far and consider price hikes relatively insignificant for professionals.
  • Several want OAuth-based access: use any harness with a monthly plan instead of being forced into the vendor’s UI.
  • Economic angle: subscriptions are subsidized/oversubscribed “loss leaders,” whereas raw API tokens are priced higher.

Bans, Sovereign Models, and Trust

  • The author’s loss of access to consumer endpoints (for using them via a custom harness) prompts discussion:
    • Some say using unpublished/subsidized endpoints this way is understandably disallowed.
    • Others see it as arbitrary, similar to platform bans, reinforcing the need for self-hostable “sovereign” models and open harnesses.
  • Side debate over large labs’ historic scraping behavior and current claims of respecting robots.txt.

Limitations and Skepticism About the Results

  • Several commenters think the technique is promising but oversold.
    • The benchmark is narrow (find-and-replace style edits); a 5–14 point boost there may translate to only modest real-world gains.
    • Desire for analysis that separates pure harness failures from reasoning failures.
    • Note that some existing systems (e.g., Codex) already use constrained grammars for patches, so comparisons may be incomplete.

Broader Reflections on Coding Agents

  • Multiple accounts confirm that modest harness tweaks (better edit tools, repo maps, validation steps) massively improve reliability, especially for security-sensitive changes.
  • There’s ongoing confusion about “best” coding harnesses; some users are gravitating toward lightweight, extensible OSS agents and even writing their own.
  • Longer-term concerns: dependence on a few vendors that can deplatform users, and wider societal impacts if AI-assisted coding accelerates job displacement.