GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration
GlobalDentBench is introduced as the first multinational benchmark for evaluating large language models (LLMs) in clinical reasoning within dentistry. The benchmark includes 8,978 expert-validated questions across various formats and assesses different levels of reasoning complexity. Findings indicate significant performance degradation in LLMs as reasoning complexity increases, highlighting critical safety concerns in LLM-generated clinical recommendations.
- ▪GlobalDentBench features a taxonomy covering 14 dental specialties across 88 countries and regions.
- ▪The benchmark assesses three reasoning levels: knowledge recall, routine reasoning, and individualized reasoning.
- ▪Evaluation of 12 LLMs showed accuracy dropped significantly from 81.34% on multiple-choice questions to 22.34% on case-based questions.
- ▪An alarming 31.01% of LLM-generated clinical recommendations were found to be unsafe, with some posing risks of irreversible patient harm.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.24636 (cs) [Submitted on 23 May 2026] Title:GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration Authors:Junjie Zhao, Jingyi Liang, Zhenyang Cai, Jiaming Zhang, Zhenwei Wen, Shuzhi Deng, Wenjing Yi, Chunfeng Luo, Hexian Zhang, Junying Chen, Tianrui Liu, Zhuhui Bai, Zixu Zhang, Pradeep Singh, Xiang Liu, Jianquan Li, Nhan L Tran, Falk Schwendicke, Zuolin Jin, Lijian Jin, Liangyi Chen, Wei-fa Yang, Benyou Wang, Junwen Wang, Shan Jiang View a PDF of the paper titled GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration, by Junjie Zhao and Jingyi Liang and Zhenyang Cai and Jiaming Zhang and Zhenwei Wen and Shuzhi…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.