Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks
The paper explores the limitations of Multi-Modal Large Language Models (MLLMs) in spatial reasoning, particularly under perceptual constraints. It introduces a new approach called the Epistemic Sensory Bottleneck, which enhances the MLLMs' ability to infer beliefs in multi-agent environments. The findings indicate that while current models struggle with spatial symmetry, the proposed method significantly improves performance in spatial reasoning tasks.
- ▪Multi-Modal Large Language Models (MLLMs) face challenges in embodied spatial intelligence due to a reliance on text-based probability distributions.
- ▪The study introduces an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought to improve spatial reasoning in MLLMs.
- ▪Current MLLMs achieve a zero-shot accuracy baseline of 42% in spatial tasks, while the proposed method outperforms existing baselines.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.18194 (cs) [Submitted on 18 May 2026] Title:Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks Authors:Yajing Zhou, Xiangyu Kong View a PDF of the paper titled Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks, by Yajing Zhou and 1 other authors View PDF HTML (experimental) Abstract:While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.