Advances in Question Answering on Visually Rich Documents

Saturday 01 March 2025


The survey of question answering on visually rich documents has made significant progress in recent years, thanks to the development of large language models (LLMs) and their ability to process complex visual information.


One of the key challenges in this field is the need for LLMs to be able to understand the layout and structure of a document, as well as the text it contains. This requires a deep understanding of both the linguistic and visual aspects of the document.


To address this challenge, researchers have developed new methods for integrating visual and textual information, such as using attention mechanisms to focus on specific parts of the document or using graph neural networks to model the relationships between different elements.


These advances have enabled LLMs to be trained on large datasets of visually rich documents, such as PDFs and scanned images, and to perform well on a range of question answering tasks, including extracting information from tables and charts.


The survey also highlights the importance of evaluating these models on real-world datasets, rather than just using synthetic data. This is because the performance of LLMs can vary significantly depending on the specific dataset they are trained on, and real-world datasets often contain a wide range of visual and textual styles that are not present in synthetic data.


Overall, the survey provides a comprehensive overview of the current state of question answering on visually rich documents, and highlights the potential for these models to have a significant impact on a wide range of applications, from information retrieval to document summarization.


Cite this article: “Advances in Question Answering on Visually Rich Documents”, The Science Archive, 2025.


Large Language Models, Visually Rich Documents, Question Answering, Attention Mechanisms, Graph Neural Networks, Pdfs, Scanned Images, Tables, Charts, Real-World Datasets, Synthetic Data


Reference: Camille Barboule, Benjamin Piwowarski, Yoan Chabot, “Survey on Question Answering over Visually Rich Documents: Methods, Challenges, and Trends” (2025).


Leave a Reply