Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

Overview of Dirac Agent / Harness

  • Dirac is a heavily modified fork of the Cline harness, with both a CLI (dirac-cli) and a VS Code extension.
  • It topped TerminalBench 2.0 using gemini-3-flash-preview and supports many providers/models (OpenAI, Qwen, open weights via OpenRouter or custom OpenAI-compatible endpoints).
  • Plan-and-act style workflows and subagents from Cline are preserved and extended.

Key Techniques and Design Choices

  • Uses an optimized “hash-anchored edits” approach for file modifications; anchors are single tokens (later two-token combos) mapped via a diff-based mechanism.
  • Employs Tree-sitter-based AST parsing for ~14 languages to:
    • Select relevant code regions instead of loading whole large files.
    • Drive symbol-aware search/refactor operations.
  • Batches many file reads/edits into single tool calls to overcome models’ reluctance to issue parallel tool calls.
  • Lets models execute code (bash/python/etc.) as tools to analyze or transform code.
  • Maintains a local SQLite “symbols DB” updated incrementally for faster semantic queries.

Performance, Benchmarks, and Harness vs Model

  • Multiple comments highlight that harness design can matter more than which frontier model is used; swapping harnesses often changes benchmark scores more than swapping models.
  • Dirac’s own small eval suite compares it to other agents (including pi and OpenCode); tasks needing symbol-aware edits show clearer gains from AST usage.
  • There is interest in benchmarking with non-Gemini models and measuring time-to-completion and token usage, but OSS models often hit TerminalBench timeouts due to slow inference.

Limitations, Concerns, and Open Questions

  • AST features only work for languages with available parsers; without them, Dirac falls back to simpler behavior.
  • Some users question whether hash anchors are actually more token-efficient than smart search/replace, suggesting file skeleton display may be the bigger win.
  • Telemetry and feature-flag calls are on by default, and web tools previously proxied via the project’s servers; this raised privacy concerns and led to removal of web tools and clarifications.
  • Context management strategies (pruning vs relying on provider caching) and subagent delegation remain active areas of experimentation, with mixed experiences across models.