Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models
A new study explores how generative Vision-Language Models (VLMs) transform visual inputs into text. The authors propose a function-centric framework using Transcoders to better understand the computational pathways linking images to text generation. Their findings indicate that this approach yields more interpretable and predictive insights into multimodal computation.
- ▪Generative Vision-Language Models perform well on multimodal reasoning but lack clarity on visual-to-text transformation.
- ▪The study introduces a function-centric framework based on Transcoders to analyze VLMs.
- ▪Transcoder attributions provide stronger effects on visually grounded tokens compared to Sparse Autoencoders.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.22902 (cs) [Submitted on 21 May 2026] Title:Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models Authors:Dimitrios Damianos, Leon Voukoutis, Georgios Skyrianos, Vassilis Katsouros, Georgios Paraskevopoulos View a PDF of the paper titled Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models, by Dimitrios Damianos and 4 other authors View PDF HTML (experimental) Abstract:Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly understood.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.