DocEdit Redefined: In-Context Learning for Multimodal Document Editing

Waseem, Muhammad; Biswas, Sanket; Lladós, Josep

Muhammad Waseem, Sanket Biswas, Josep Lladós; Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops, 2025, pp. 1497-1501

Abstract

Structured document generation relies on mapping logical content to its displayed form which requires integration of visual textual and layout elements. This paper presents a novel approach to structured document editing by leveraging the object recognition capabilities of Visual-Language Models (VLMs) to eliminate the dependency on specialized segmentation modules. By integrating state-of-the-art open-world VLMs we propose a simple in-context learning framework that enhances the flexibility and efficiency of Language-guided Document Editing (DocEdit) tasks. Our method demonstrates promising performance in addressing complex document modifications such as spatial alignment component merging and regional grouping while maintaining the coherence and intent of the original document. Through our proposed few-shot evaluation benchmark suite we highlight the potential of VLMs in this direction. Furthermore we propose a refined evaluation protocol that incorporates both spatial and semantic reasoning ensuring a comprehensive assessment of the modified (edited) output for the task. Experimental results underscore the effectiveness of our framework in advancing the capabilities of structured document editing systems.

Related Material

[pdf]

[bibtex]

@InProceedings{Waseem_2025_WACV, author = {Waseem, Muhammad and Biswas, Sanket and Llad\'os, Josep}, title = {DocEdit Redefined: In-Context Learning for Multimodal Document Editing}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {1497-1501} }