LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
LinAlg-Bench is a new diagnostic benchmark designed to evaluate large language models on linear algebra computations. It assesses models across various matrix dimensions and identifies structural failure modes in their mathematical reasoning. The findings indicate a significant behavioral shift in model performance as matrix size increases, revealing insights into the limitations of current AI systems.
- ▪LinAlg-Bench evaluates 10 large language models on structured linear algebra computations involving 3x3, 4x4, and 5x5 matrices.
- ▪The benchmark includes 660 SymPy-certified problems and classifies 1,156 failures into ten primary error tags.
- ▪A key finding is a behavioral threshold at the 4x4 matrix scale, where models transition from execution errors to computational abandonment.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.16675 (cs) [Submitted on 15 May 2026] Title:LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning Authors:Shradha Agarwal, Deepak Rajbhar, Tariq J View a PDF of the paper titled LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning, by Shradha Agarwal and 2 other authors View PDF HTML (experimental) Abstract:We introduce LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of 3x3, 4x4, and 5x5 matrices. Spanning 9 task types and 660 SymPy-certified problems, the benchmark exhaustively evaluates 6,600 model outputs.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.