Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

May 27, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 18 views

#artificial intelligence #machine learning #mathematics

⚡ TL;DR · AI summary

A recent study evaluates how Large Language Models (LLMs) perform on mathematical reasoning tasks when faced with variations in problem statements. The research compares three methods: chain-of-thought prompting, single-shot code execution, and iterative code execution. Results indicate that while all methods showed some accuracy drop, chain-of-thought prompting was the most robust against variations.

Key facts

▪Large Language Models achieve high accuracy on math benchmarks but struggle with variations in problem statements.
▪The study tested three approaches on 1,000 problems from the GSM-Symbolic dataset.
▪Chain-of-thought prompting was found to be the most robust method, with the least accuracy drop.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Artificial Intelligence arXiv:2605.26414 (cs) [Submitted on 26 May 2026] Title:Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions Authors:Matthew Kutakh View a PDF of the paper titled Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions, by Matthew Kutakh View PDF HTML (experimental) Abstract:Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Discussion

More from arXiv cs.AI