WeSearch

ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

·3 min read · 0 reactions · 0 comments · 14 views
#software#engineering#artificial intelligence
ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents
⚡ TL;DR · AI summary

The article introduces ProcBench, a new benchmark designed to evaluate process-level defects in LLM coding agents. Unlike existing benchmarks that focus solely on final outcomes, ProcBench assesses execution processes and organizes defects into a reusable ontology. The evaluation reveals significant differences in execution quality that traditional outcome-based metrics may overlook.

Key facts
Original article
arXiv cs.AI
Read full at arXiv cs.AI →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Software Engineering arXiv:2605.20251 (cs) [Submitted on 18 May 2026 (v1), last revised 21 May 2026 (this version, v2)] Title:ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents Authors:Jiawei He, Jie Jia, Chenbo Liu, Chaoyi Xue, Yapeng Song, Xikai Yang, Dong Sun View a PDF of the paper titled ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents, by Jiawei He and 6 other authors View PDF HTML (experimental) Abstract:Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics provide limited visibility and often miss defects that arise during execution.

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv cs.AI