QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proofs remains an outstanding challenge for LLMs. Through systematic experiments with frontier LLMs on research-level proof tasks, we identify seven failure modes that prevent reliable proof generation, including context contamination, citation hallucination, hand-waving on key steps and misallocation of proof effort, unstable proof plans, unfocused verification, problem modification and single-model bottleneck. We argue that the gap between benchmark success and research-level proving is primarily one of system design, due to those failure modes. We present QED, an open-source multi-agent proof system in which each architectural decision directly addresses a specific failure mode. Evaluated on five open problems in applied analysis and PDEs contributed by domain experts, QED produces correct proofs for three problems, each verified by the contributing experts as original and nontrivial. QED is released as open-source software at https://github.com/proofQED/QED.
Full article excerpt tap to expand
Computer Science > Artificial Intelligence arXiv:2604.24021 (cs) [Submitted on 27 Apr 2026] Title:QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems Authors:Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang View a PDF of the paper titled QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems, by Chenyang An and 3 other authors View PDF HTML (experimental) Abstract:We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proofs remains an outstanding challenge for LLMs. Through systematic experiments with frontier LLMs on research-level proof tasks, we identify seven failure modes that prevent reliable proof generation, including context contamination, citation hallucination, hand-waving on key steps and misallocation of proof effort, unstable proof plans, unfocused verification, problem modification and single-model bottleneck. We argue that the gap between benchmark success and research-level proving is primarily one of system design, due to those failure modes. We present QED, an open-source multi-agent proof system in which each architectural decision directly addresses a specific failure mode. Evaluated on five open problems in applied analysis and PDEs contributed by domain experts, QED produces correct proofs for three problems, each verified by the contributing experts as original and nontrivial. QED is released as open-source software at this https URL. Subjects: Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP) Cite as: arXiv:2604.24021 [cs.AI] (or arXiv:2604.24021v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.24021 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chenyang An [view email] [v1] Mon, 27 Apr 2026 04:10:02 UTC (62 KB) Full-text links: Access Paper: View a PDF of the paper titled QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems, by Chenyang An and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI < prev | next > new | recent | 2026-04 Change to browse by: cs math math.AP References & Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading... BibTeX formatted citation × loading... Data provided by: Bookmark Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue…
This excerpt is published under fair use for community discussion. Read the full article at arXiv.org.