DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization

Phan Phuong Mai Chau, Souhail Bakkali, Antoine Doucet; Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops, 2025, pp. 1303-1312

Abstract


Abstractive summarization has made significant strides in condensing and rephrasing large volumes of text into coherent summaries. However summarizing administrative documents presents unique challenges due to domain-specific terminology OCR-generated errors and the scarcity of annotated datasets for model fine-tuning. Existing models often struggle to adapt to the intricate structure and specialized content of such documents. To address these limitations we introduce DocSum a domain-adaptive abstractive summarization framework tailored for administrative documents. Leveraging pre-training on OCR-transcribed text and fine-tuning with an innovative integration of question-answer pairs DocSum enhances summary accuracy and relevance. This approach tackles the complexities inherent in administrative content ensuring outputs that align with real-world business needs. To evaluate its capabilities we define a novel downstream task setting--Document Abstractive Summarization--which reflects the practical requirements of business and organizational settings. Comprehensive experiments demonstrate DocSum's effectiveness in producing high-quality summaries showcasing its potential to improve decision-making and operational workflows across the public and private sectors.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Chau_2025_WACV, author = {Chau, Phan Phuong Mai and Bakkali, Souhail and Doucet, Antoine}, title = {DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {1303-1312} }