QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
Researchers have introduced QSTRBench, a new benchmark designed to evaluate the reasoning capabilities of language models in qualitative spatial and temporal contexts. The benchmark includes a variety of reasoning tasks across different calculi, revealing that while models perform better than random guessing, they struggle with consistent accuracy. The study highlights significant variations in performance depending on the complexity of the calculus used.
- ▪QSTRBench evaluates large language models' reasoning in qualitative spatial and temporal contexts.
- ▪The benchmark includes various reasoning tasks across multiple calculi, such as Point Algebra and Allen's Interval Algebra.
- ▪Results show that while models outperform random guessing, none can consistently answer all questions correctly.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.18380 (cs) [Submitted on 18 May 2026] Title:QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi Authors:Anthony G. Cohn, Robert E. Blackwell View a PDF of the paper titled QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi, by Anthony G. Cohn and Robert E. Blackwell View PDF HTML (experimental) Abstract:We introduce an extensive qualitative spatial and temporal reasoning (QSTR) benchmark for evaluating large language models (LLMs).
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.