Benchmarking Local LLM/Harness Combinations

Pasquale Minervini· Apr 30, 2026 · 6:11 PM UTC ·12 min read · 0 reactions · 0 comments · 3 views

I’ve been running a small benchmark, harness-bench, that pairs local LLMs (served via llama.cpp’s llama-server) with agent harnesses (Aider, Claude Code, Ope...

Original article

neuralnoise.com · Pasquale Minervini

Read full at neuralnoise.com →

Opening excerpt (first ~120 words) tap to expand

28 Apr 2026 • on llama-cpp agents coding-agents quantisation local-llms [WIP] Benchmarking Local LLMs Against Coding Agent Harnesses I’ve been running a small benchmark, harness-bench, that pairs local LLMs (served via llama.cpp’s llama-server) with agent harnesses (Aider, Claude Code, OpenCode, Pi, Qwen CLI) on 16 software-engineering tasks across Python, PyTorch, JAX, C, C++, Rust, and SQL. Each (model, harness, task) cell is sandboxed: the agent only sees a scratch workspace/ and grading is done by a hidden test.sh that the agent never sees. The current sweep is 17 model-quants × 5 harnesses × 16 tasks = 1360 runs on a single M3 Max / 128 GB laptop.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at neuralnoise.com.

Anonymous · no account needed

Discussion

0 comments

Benchmarking Local LLM/Harness Combinations

Discussion

More from neuralnoise.com