M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
The article discusses a new method called M3DocDep for processing long, multi-page documents using large vision-language models. This method aims to improve the chunking of documents by recovering block-level dependencies before creating retrieval units. The results indicate significant improvements in retrieval and answer quality metrics compared to existing methods.
- ▪M3DocDep is designed to enhance retrieval-augmented generation in long, multi-page industrial documents.
- ▪The method addresses issues with existing chunkers that fail to capture cross-page relationships and other structural cues.
- ▪M3DocDep shows improvements in various benchmarks, including a 28.5 to 39.6 percent increase in STEDS and a 1.1 to 15.3 percent increase in retrieval nDCG.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Information Retrieval arXiv:2605.18774 (cs) [Submitted on 17 Apr 2026] Title:M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models Authors:Joongmin Shin, Jeongbae Park, Jaehyung Seo, Heuiseok Lim View a PDF of the paper titled M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models, by Joongmin Shin and 3 other authors View PDF HTML (experimental) Abstract:In long, multi-page industrial documents, retrieval-augmented generation (RAG) depends heavily on whether chunk boundaries follow the document's true structure.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.