2026-02-12

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

Edit Addressing: Line Numbers, Hashes, and Structure

Several commenters compare the post’s hash-per-line scheme to simpler “line numbers only” addressing.
- Line numbers are more compact but fragile when files change between read and write or after multiple edits.
- Hashes (or hash-like tags) make edits robust to shifting lines and avoid clobbering mismatched content.
Some worry about loss of concurrency: search/replace lets multiple edits proceed independently; line- or hash-based schemes can serialize writes and require more reindexing. Others report that in practice serialization is fine and token savings are worth it.
Alternatives discussed:
- TOC-style “content_point” references per symbol or function.
- Tree-sitter / AST tools that list and update nodes by IDs or hashes.
- Fuzzy matching (e.g., Damerau–Levenshtein) to confirm intended replacements rather than requiring exact matches.

Harness as Primary Leverage Point

Strong agreement that the “harness” (tools, context management, edit protocol, feedback loop) often matters more than model choice.
- Same model can jump from “barely usable” to “legitimately helpful” with better context and edit tools.
- Benchmarks like CORE, TerminalBench, and browser agents show large swings in scores purely from harness changes.
Some frame the real “AI system” as LLM + harness + human-in-the-loop, a cybernetic or neurosymbolic whole rather than just the model.
Many expect future developers to spend more time designing harnesses and workflows than hand-writing code.

Closed Harnesses, Subscriptions, and Lock‑in

Big debate over proprietary harnesses (e.g., IDE integrations, terminal agents) tied to flat-rate subscriptions.
- One side sees lock-in, telemetry, future “enshitification,” and incentives to waste tokens.
- Others report subscriptions only improving so far and consider price hikes relatively insignificant for professionals.
Several want OAuth-based access: use any harness with a monthly plan instead of being forced into the vendor’s UI.
Economic angle: subscriptions are subsidized/oversubscribed “loss leaders,” whereas raw API tokens are priced higher.

Bans, Sovereign Models, and Trust

The author’s loss of access to consumer endpoints (for using them via a custom harness) prompts discussion:
- Some say using unpublished/subsidized endpoints this way is understandably disallowed.
- Others see it as arbitrary, similar to platform bans, reinforcing the need for self-hostable “sovereign” models and open harnesses.
Side debate over large labs’ historic scraping behavior and current claims of respecting robots.txt.

Limitations and Skepticism About the Results

Several commenters think the technique is promising but oversold.
- The benchmark is narrow (find-and-replace style edits); a 5–14 point boost there may translate to only modest real-world gains.
- Desire for analysis that separates pure harness failures from reasoning failures.
- Note that some existing systems (e.g., Codex) already use constrained grammars for patches, so comparisons may be incomplete.

Broader Reflections on Coding Agents

Multiple accounts confirm that modest harness tweaks (better edit tools, repo maps, validation steps) massively improve reliability, especially for security-sensitive changes.
There’s ongoing confusion about “best” coding harnesses; some users are gravitating toward lightweight, extensible OSS agents and even writing their own.
Longer-term concerns: dependence on a few vendors that can deplatform users, and wider societal impacts if AI-assisted coding accelerates job displacement.

Related topics