WeSearch

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

·3 min read · 0 reactions · 0 comments · 18 views
#artificial intelligence#coding agents#benchmarking
EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions
⚡ TL;DR · AI summary

EvoCode-Bench is a newly introduced benchmark designed to evaluate coding agents in multi-turn iterative interactions. It consists of 26 stateful coding tasks and assesses the agents' ability to maintain their codebase as requirements evolve. The study reveals that while some agents perform well in single-round evaluations, they struggle significantly in multi-turn scenarios, highlighting the challenges in specification tracking and regression failures.

Key facts
Original article
arXiv cs.AI
Read full at arXiv cs.AI →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Artificial Intelligence arXiv:2605.24110 (cs) [Submitted on 22 May 2026] Title:EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions Authors:Haiyang Shen, Xuanzhong Chen, Wendong Xu, Yun Ma, Liang Chen, Kuan Li View a PDF of the paper titled EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions, by Haiyang Shen and 5 other authors View PDF HTML (experimental) Abstract:Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds.

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv cs.AI