Stop Comparing LLM Agents Without Disclosing the Harness
The paper titled 'Stop Comparing LLM Agents Without Disclosing the Harness' argues that the performance of language model agents is more influenced by the execution harness than by the models themselves. It introduces the Binding Constraint Thesis, which suggests that harness configuration can lead to significant performance variances. The authors propose a new evaluation framework that emphasizes the need for transparency in harness specifications to avoid misleading comparisons.
- ▪The agent execution harness is a crucial factor in determining agent performance.
- ▪Small changes in harness configuration can lead to larger performance shifts than changing the model.
- ▪Current evaluation protocols may misattribute performance gains to model improvements rather than harness configurations.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.23950 (cs) [Submitted on 7 May 2026] Title:Stop Comparing LLM Agents Without Disclosing the Harness Authors:Yunbei Zhang, Janet Wang, Yingqiang Ge, Weijie Xu, Jihun Hamm, Chandan K. Reddy View a PDF of the paper titled Stop Comparing LLM Agents Without Disclosing the Harness, by Yunbei Zhang and 5 other authors View PDF HTML (experimental) Abstract:This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer that governs context construction, tool interaction, orchestration, and verification around a language model, is often a stronger determinant of agent performance than the model it wraps.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.