LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design
The paper introduces LEAPBench, a framework for evaluating the learning efficiency of large language models (LLMs) in iterative scientific design. It highlights the importance of measuring learning trajectories rather than just final outcomes, revealing that LLMs often do not outperform classical Bayesian baselines. The study shows that using trajectory scoring can significantly alter the perceived efficiency of LLMs across various tasks.
- ▪LEAPBench is a 55-task framework designed to assess learning efficiency in adaptive processes.
- ▪Switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks.
- ▪LLMs do not outperform a classical Bayesian baseline in the evaluated tasks.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.15341 (cs) [Submitted on 14 May 2026] Title:LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design Authors:Marilyn Zhang, Tianfeng Chen, Fabián Barzuna, Ankita Rathod, Mark E. Whiting View a PDF of the paper titled LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design, by Marilyn Zhang and 4 other authors View PDF Abstract:LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.