MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
MedCUA-Bench is a newly introduced benchmark designed specifically for clinical computer-use agents. It aims to address the lack of reliable testing environments for medical software by covering 18 clinical scenarios across 10 medical domains. The benchmark reveals significant gaps in the performance of current agents when applied to real clinical interfaces.
- ▪MedCUA-Bench focuses on automating repetitive screen-based clinical work.
- ▪It includes 18 clinical scenarios reconstructed from real product manuals.
- ▪The best closed-source model achieved only 54.2% success on the benchmark.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2606.03203 (cs) [Submitted on 2 Jun 2026] Title:MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents Authors:Jia Yu, Zilong Wang, Xinyang Jiang, Dongsheng Li, Shuo Wang View a PDF of the paper titled MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents, by Jia Yu and 4 other authors View PDF HTML (experimental) Abstract:Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.