Benchmarking Local LLM/Harness Combinations
I’ve been running a small benchmark, harness-bench, that pairs local LLMs (served via llama.cpp’s llama-server) with agent harnesses (Aider, Claude Code, Ope...
Opening excerpt (first ~120 words) tap to expand
28 Apr 2026 • on llama-cpp agents coding-agents quantisation local-llms [WIP] Benchmarking Local LLMs Against Coding Agent Harnesses I’ve been running a small benchmark, harness-bench, that pairs local LLMs (served via llama.cpp’s llama-server) with agent harnesses (Aider, Claude Code, OpenCode, Pi, Qwen CLI) on 16 software-engineering tasks across Python, PyTorch, JAX, C, C++, Rust, and SQL. Each (model, harness, task) cell is sandboxed: the agent only sees a scratch workspace/ and grading is done by a hidden test.sh that the agent never sees. The current sweep is 17 model-quants × 5 harnesses × 16 tasks = 1360 runs on a single M3 Max / 128 GB laptop.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at neuralnoise.com.