LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

May 27, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 28 views

#artificial intelligence #education #machine learning

TL;DR · WeSearch summary

The paper introduces LiveK12Bench, a benchmark designed to evaluate the reasoning abilities of large multimodal models (LMMs) in realistic high school examination scenarios. It highlights the limitations of existing benchmarks and presents a dynamic framework that includes a large set of verified questions from real-world exams. The findings reveal significant performance drops in advanced models when faced with exam-like conditions, indicating a gap between theoretical capabilities and practical educational readiness.

Key facts

▪LiveK12Bench comprises over 2,000 verified questions across various subjects, including Mathematics, Physics, Chemistry, and Biology.
▪The benchmark features an automated pipeline to continuously update and mitigate data leakage from examination papers.
▪Experiments show that advanced models like GPT-5 experience a score drop from 79 to 53 when evaluated under realistic exam constraints.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Artificial Intelligence arXiv:2605.26781 (cs) [Submitted on 26 May 2026] Title:LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? Authors:Xiaohan Wang, Mingze Yin, Yilin Zhao, Gang Liu, Dian Li View a PDF of the paper titled LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?, by Xiaohan Wang and 4 other authors View PDF HTML (experimental) Abstract:Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Discussion

More from arXiv cs.AI