Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview
Overview of Dirac Agent / Harness
- Dirac is a heavily modified fork of the Cline harness, with both a CLI (
dirac-cli) and a VS Code extension. - It topped TerminalBench 2.0 using
gemini-3-flash-previewand supports many providers/models (OpenAI, Qwen, open weights via OpenRouter or custom OpenAI-compatible endpoints). - Plan-and-act style workflows and subagents from Cline are preserved and extended.
Key Techniques and Design Choices
- Uses an optimized “hash-anchored edits” approach for file modifications; anchors are single tokens (later two-token combos) mapped via a diff-based mechanism.
- Employs Tree-sitter-based AST parsing for ~14 languages to:
- Select relevant code regions instead of loading whole large files.
- Drive symbol-aware search/refactor operations.
- Batches many file reads/edits into single tool calls to overcome models’ reluctance to issue parallel tool calls.
- Lets models execute code (bash/python/etc.) as tools to analyze or transform code.
- Maintains a local SQLite “symbols DB” updated incrementally for faster semantic queries.
Performance, Benchmarks, and Harness vs Model
- Multiple comments highlight that harness design can matter more than which frontier model is used; swapping harnesses often changes benchmark scores more than swapping models.
- Dirac’s own small eval suite compares it to other agents (including pi and OpenCode); tasks needing symbol-aware edits show clearer gains from AST usage.
- There is interest in benchmarking with non-Gemini models and measuring time-to-completion and token usage, but OSS models often hit TerminalBench timeouts due to slow inference.
Limitations, Concerns, and Open Questions
- AST features only work for languages with available parsers; without them, Dirac falls back to simpler behavior.
- Some users question whether hash anchors are actually more token-efficient than smart search/replace, suggesting file skeleton display may be the bigger win.
- Telemetry and feature-flag calls are on by default, and web tools previously proxied via the project’s servers; this raised privacy concerns and led to removal of web tools and clarifications.
- Context management strategies (pruning vs relying on provider caching) and subagent delegation remain active areas of experimentation, with mixed experiences across models.