-
[pdf]
[arXiv]
[bibtex]@InProceedings{Chau_2025_WACV, author = {Chau, Phan Phuong Mai and Bakkali, Souhail and Doucet, Antoine}, title = {DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {1303-1312} }
DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization
Abstract
Abstractive summarization has made significant strides in condensing and rephrasing large volumes of text into coherent summaries. However summarizing administrative documents presents unique challenges due to domain-specific terminology OCR-generated errors and the scarcity of annotated datasets for model fine-tuning. Existing models often struggle to adapt to the intricate structure and specialized content of such documents. To address these limitations we introduce DocSum a domain-adaptive abstractive summarization framework tailored for administrative documents. Leveraging pre-training on OCR-transcribed text and fine-tuning with an innovative integration of question-answer pairs DocSum enhances summary accuracy and relevance. This approach tackles the complexities inherent in administrative content ensuring outputs that align with real-world business needs. To evaluate its capabilities we define a novel downstream task setting--Document Abstractive Summarization--which reflects the practical requirements of business and organizational settings. Comprehensive experiments demonstrate DocSum's effectiveness in producing high-quality summaries showcasing its potential to improve decision-making and operational workflows across the public and private sectors.
Related Material