PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
The article introduces PRISM, a new benchmark for evaluating programmatic spatial-temporal reasoning in video generation. It consists of over 10,000 instruction-code pairs and aims to address the challenges of assessing spatial coherence in outputs from language models. The findings highlight a significant gap between code execution success and spatial correctness, emphasizing the need for comprehensive evaluation metrics.
- ▪PRISM includes 10,372 human-calibrated instruction-code pairs, making it 20 times larger than previous benchmarks.
- ▪The evaluation framework features four metrics: Code-Level Reliability, Spatial Reasoning, Prompt-Aware Dynamic Visual Complexity, and Temporal Density.
- ▪A study of seven mainstream language models revealed a 41% average drop from execution success to spatial pass rate.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.19382 (cs) [Submitted on 19 May 2026] Title:PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning Authors:Qiran Zhang, Yuheng Wang, Runde Yang, Lin Wu, Jingru Fan, Shu Yao, Jie Zhang, Tianle Zhou, Huatao Li, Ruijie Shi, Yihan Li, Chen Qian View a PDF of the paper titled PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning, by Qiran Zhang and 11 other authors View PDF Abstract:Programmatic video generation through code offers geometric precision and temporal coherence beyond pixel-level diffusion models, yet rigorously evaluating whether language models can produce spatially correct animated outputs remains an open problem.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.