WeSearch

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

Β·4 min read Β· 0 reactions Β· 0 comments Β· 0 views
Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%. Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately ( ), I would like to also clarify a few things 1. Absolutely no {agents/skills}.md files were inserted at any point. No cheating mechanisms whatsoever 2. The cli agent was run in leaderboard compliant way (no modification of resources or timeouts) 3. The full terminal bench run was done using the fully open source

Original article
GitHub
Read full at GitHub β†’
Full article excerpt tap to expand

Dirac - Accurate & Highly Token Efficient Open Source AI Agent Dirac topped the Terminal-Bench-2 leaderboard for gemini-3-flash-preview with a 65.2% score! It is a well studied phenomenon that any given model's reasoning ability degrades with the context length. If we can keep context tightly curated, we improve both accuracy and cost while making larger changes tractable in a single task. Dirac is an open-source coding agent built with this in mind. It reduces API costs by 64.8% on average while producing better and faster work. Using hash-anchored parallel edits, AST manipulation, and a suite of advanced optimizations. Oh, and no MCP. Our goal: Optimize for bang-for-the-buck on tooling with bare minimum prompting instead of going blindly minimalistic. πŸ“Š Evals Dirac is benchmarked against other leading open-source agents on complex, real-world refactoring tasks. Dirac consistently achieves 100% accuracy at a fraction of the cost. These evals are run on public github repos and should be reproducible by anyone. πŸ† TerminalBench 2.0 Leaderboard: Dirac recently topped the Terminal-Bench-2 leaderboard with a 65.2% score using gemini-3-flash-preview. This outperforms both Google's official baseline (47.6%) and the top closed-source agent Junie CLI (64.3%). This was achieved without any benchmark-specific info or any AGENTS.md files being inserted. Note on the cost table below: A bug was discovered in Cline, the parent repo, after running these evals (issue #10314). We have submitted a PR #10315 to fix this. This bug caused the evals for Dirac and Cline to slightly underreport the numbers ($0.03 vs $0.05 per million token cache read). Although there won't be a large difference, we will update the evals soon. All tasks for all models used gemini-3-flash-preview with thinking set to high Task (Repo) Files* Cline Kilo Ohmypi Opencode Pimono Roo Dirac Task1 (transformers) 8 🟒 (diff) [$0.37] πŸ”΄ (diff) [N/A] 🟑 (diff) [$0.24] 🟒 (diff) [$0.20] 🟒 (diff) [$0.34] 🟒 (diff) [$0.49] 🟒 (diff) [$0.13] Task2 (vscode) 21 🟒 (diff) [$0.67] 🟑 (diff) [$0.78] 🟒 (diff) [$0.63] 🟒 (diff) [$0.40] 🟒 (diff) [$0.48] 🟑 (diff) [$0.58] 🟒 (diff) [$0.23] Task3 (vscode) 12 🟑 (diff) [$0.42] 🟒 (diff) [$0.70] 🟒 (diff) [$0.64] 🟒 (diff) [$0.32] 🟒 (diff) [$0.25] 🟑 (diff) [$0.45] 🟒 (diff) [$0.16] Task4 (django) 14 🟒 (diff) [$0.36] 🟒 (diff) [$0.42] 🟑 (diff) [$0.32] 🟒 (diff) [$0.24] 🟑 (diff) [$0.24] 🟒 (diff) [$0.17] 🟒 (diff) [$0.08] Task5 (vscode) 3 πŸ”΄ (diff) [N/A] 🟒 (diff) [$0.71] 🟒 (diff) [$0.43] 🟒 (diff) [$0.53] 🟒 (diff) [$0.50] 🟒 (diff) [$0.36] 🟒 (diff) [$0.17] Task6 (transformers) 25 🟒 (diff) [$0.87] 🟑 (diff) [$1.51] 🟒 (diff) [$0.94] 🟒 (diff) [$0.90] 🟒 (diff) [$0.52] 🟒 (diff) [$1.44] 🟒 (diff) [$0.34] Task7 (vscode) 13 🟑 (diff) [$0.51] 🟒 (diff) [$0.77] 🟒 (diff) [$0.74] 🟒 (diff) [$0.67] 🟑 (diff) [$0.45] 🟒 (diff) [$1.05] 🟒 (diff) [$0.25] Task8 (transformers) 3 🟒 (diff) [$0.25] 🟒 (diff) [$0.19] 🟒 (diff) [$0.17] 🟒 (diff) [$0.26] 🟒 (diff) [$0.23] 🟒 (diff) [$0.29] 🟒 (diff) [$0.12] Total Correct 5/8 5/8 6/8 8/8 6/8 6/8 8/8 Avg Cost $0.49 $0.73 $0.51 $0.44 $0.38 $0.60 $0.18 🟒 Success | 🟑 Incomplete | πŸ”΄ Failure Cost Comparison: Dirac is 64.8% cheaper than the competition (a 2.8x cost reduction). * Expected number of files to be modified/created to complete the task. See evals/README.md for detailed task descriptions and methodology. πŸš€ Key Features Hash-Anchored Edits: Dirac uses stable line hashes to target edits with extreme…

This excerpt is published under fair use for community discussion. Read the full article at GitHub.

Anonymous Β· no account needed
Share 𝕏 Facebook Reddit LinkedIn Email

Discussion

0 comments

More from GitHub