Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
A recent study evaluates how Large Language Models (LLMs) perform on mathematical reasoning tasks when faced with variations in problem statements. The research compares three methods: chain-of-thought prompting, single-shot code execution, and iterative code execution. Results indicate that while all methods showed some accuracy drop, chain-of-thought prompting was the most robust against variations.
- ▪Large Language Models achieve high accuracy on math benchmarks but struggle with variations in problem statements.
- ▪The study tested three approaches on 1,000 problems from the GSM-Symbolic dataset.
- ▪Chain-of-thought prompting was found to be the most robust method, with the least accuracy drop.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.26414 (cs) [Submitted on 26 May 2026] Title:Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions Authors:Matthew Kutakh View a PDF of the paper titled Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions, by Matthew Kutakh View PDF HTML (experimental) Abstract:Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.