EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions
EvoCode-Bench is a newly introduced benchmark designed to evaluate coding agents in multi-turn iterative interactions. It consists of 26 stateful coding tasks and assesses the agents' ability to maintain their codebase as requirements evolve. The study reveals that while some agents perform well in single-round evaluations, they struggle significantly in multi-turn scenarios, highlighting the challenges in specification tracking and regression failures.
- ▪EvoCode-Bench evaluates coding agents through 26 stateful tasks and 227 rounds.
- ▪The benchmark assesses agents' performance over multiple rounds, tracking their ability to adapt to changing requirements.
- ▪Results indicate that most agents achieve only about 50% success in multi-turn metrics, with performance declining over rounds.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.24110 (cs) [Submitted on 22 May 2026] Title:EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions Authors:Haiyang Shen, Xuanzhong Chen, Wendong Xu, Yun Ma, Liang Chen, Kuan Li View a PDF of the paper titled EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions, by Haiyang Shen and 5 other authors View PDF HTML (experimental) Abstract:Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.