Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?
A recent study questions the effectiveness of vision-language benchmarks in truly assessing visual understanding in models. The research indicates that current benchmarks may not adequately evaluate the reliance on visual evidence, as model performance is only slightly affected by the removal of image tokens. This suggests a need for improved methods to assess fine-grained visual grounding in vision-language models.
- ▪The study investigates the relationship between benchmark accuracy and grounded visual understanding in vision-language models.
- ▪Findings reveal that removing a substantial fraction of image tokens minimally impacts model performance on a hallucination benchmark.
- ▪The research highlights that current benchmarks may not reliably evaluate fine-grained visual grounding in vision-language models.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computer Vision and Pattern Recognition arXiv:2605.22903 (cs) [Submitted on 21 May 2026] Title:Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision? Authors:Zixuan Lan, Luzhe Sun, Matthew R. Walter, Jiawei Zhou View a PDF of the paper titled Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?, by Zixuan Lan and 3 other authors View PDF HTML (experimental) Abstract:Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.