SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction
The article discusses a new framework called SAVER designed for multimodal information extraction in social media. It addresses the challenges of weakly related or misleading images in posts by selectively consulting visual evidence. SAVER improves performance metrics while reducing computational costs compared to traditional methods.
- ▪SAVER is a selective vision-as-needed framework for multimodal named entity recognition and relation extraction.
- ▪The framework uses a Conformal Groundability Gate to estimate visual groundability and calibrate activation thresholds.
- ▪Experiments show that SAVER consistently improves F1 scores and reduces computational costs compared to text-only and always-on multimodal baselines.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computer Vision and Pattern Recognition arXiv:2605.20713 (cs) [Submitted on 20 May 2026] Title:SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction Authors:Miaobo Hu, Shuhao Hu, Bokun Wang, Rui Chen, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao View a PDF of the paper titled SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction, by Miaobo Hu and 6 other authors View PDF Abstract:Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this setting, always-on multimodal fusion wastes computation and can amplify spurious visual cues.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.