GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory
The article introduces GTBench, a benchmark designed to evaluate large language models (LLMs) as mathematical research assistants in graph theory. It consists of 63 problems categorized by difficulty, ranging from basic definitions to complex proof construction. The study assesses five advanced models, revealing significant performance differences and implications for AI in mathematical education.
- ▪GTBench includes problems sourced from verified academic materials, organized into three difficulty groups.
- ▪The evaluation of five models shows that GPT-5 performs best, especially in basic tasks and graduate proofs.
- ▪Failure mode analysis indicates that common errors include incorrect algorithm execution and incomplete reasoning.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2606.03144 (cs) [Submitted on 2 Jun 2026] Title:GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory Authors:Noujoud Nader, Ibrahem Aljabea, Patrick Diehl, Deepti Gupta View a PDF of the paper titled GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory, by Noujoud Nader and 3 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.