LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
The paper introduces LiveK12Bench, a benchmark designed to evaluate the reasoning abilities of large multimodal models (LMMs) in realistic high school examination scenarios. It highlights the limitations of existing benchmarks and presents a dynamic framework that includes a large set of verified questions from real-world exams. The findings reveal significant performance drops in advanced models when faced with exam-like conditions, indicating a gap between theoretical capabilities and practical educational readiness.
- ▪LiveK12Bench comprises over 2,000 verified questions across various subjects, including Mathematics, Physics, Chemistry, and Biology.
- ▪The benchmark features an automated pipeline to continuously update and mitigate data leakage from examination papers.
- ▪Experiments show that advanced models like GPT-5 experience a score drop from 79 to 53 when evaluated under realistic exam constraints.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.26781 (cs) [Submitted on 26 May 2026] Title:LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? Authors:Xiaohan Wang, Mingze Yin, Yilin Zhao, Gang Liu, Dian Li View a PDF of the paper titled LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?, by Xiaohan Wang and 4 other authors View PDF HTML (experimental) Abstract:Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.