Open-World Evaluations for Measuring Frontier AI Capabilities
The paper discusses the importance of open-world evaluations in measuring AI capabilities. It highlights the limitations of traditional benchmark-based evaluations and proposes a new approach for assessing AI through real-world tasks. The authors introduce a project called CRUX aimed at conducting these evaluations regularly.
- ▪Benchmark-based evaluations can overstate or understate AI capabilities.
- ▪Open-world evaluations involve long-horizon, messy tasks assessed qualitatively.
- ▪The authors conducted an evaluation where an AI agent developed an iOS application with minimal human intervention.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.20520 (cs) [Submitted on 19 May 2026] Title:Open-World Evaluations for Measuring Frontier AI Capabilities Authors:Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J.J. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K Hadfield, Andrew B. Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, Arvind Narayanan View a PDF of the paper titled Open-World Evaluations for Measuring Frontier AI Capabilities, by Sayash Kapoor and 17 other authors View PDF HTML (experimental) Abstract:Benchmark-based evaluation remains important for tracking frontier AI progress.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.