ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
The paper introduces ArchSIBench, a benchmark designed to evaluate the architectural spatial intelligence of Vision-Language Models (VLMs). It focuses on higher-level cognitive tasks related to architectural space, which have been largely overlooked in previous research. The findings indicate significant performance gaps between VLMs and human evaluators, particularly in spatial reasoning tasks.
- ▪ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration.
- ▪The benchmark includes 3,000 question-answer pairs for comprehensive evaluation.
- ▪Most VLMs show significant differences in architectural spatial intelligence compared to human baselines.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computer Vision and Pattern Recognition arXiv:2605.20837 (cs) [Submitted on 20 May 2026] Title:ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models Authors:Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang View a PDF of the paper titled ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models, by Qirui Shen and 7 other authors View PDF HTML (experimental) Abstract:Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.