VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following
The paper discusses the limitations of vision-language models (VLMs) in performing visual path tracing tasks. Despite their strong performance in multimodal benchmarks, these models often struggle with local competition from similar distractors. The authors highlight that traditional solutions do not effectively address the issue of path-switching failures in complex visual scenarios.
- ▪Vision-language models (VLMs) show strong performance but lack robust control over visual operations.
- ▪The study focuses on line tracing tasks where models must follow a visual path amidst nearby competitors.
- ▪Failures in path following are attributed to local competition from similar distractors.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computer Vision and Pattern Recognition arXiv:2605.15672 (cs) [Submitted on 15 May 2026] Title:VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following Authors:Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon, Albert No View a PDF of the paper titled VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following, by Hyesoo Hong and 5 other authors View PDF HTML (experimental) Abstract:Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.