CoD: Coherent Detection of Entities From Images With Multiple Modalities

Verma, Vinay; Sanny, Dween; Singh, Abhishek; Gupta, Deepak

Vinay Verma, Dween Sanny, Abhishek Singh, Deepak Gupta; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 8015-8024

Abstract

However, in real-world scenarios, multiple sources of data in different modalities are often present, making it difficult to accurately define object boundaries for various products or information. For instance, while extracting information from a document, it may be necessary to utilize both visual information (e.g., image/object) and textual information from OCR to detect and classify information associated with objects, such as text blocks, tables, and figures. If visual and textual information pertain to the same object, the model should detect the bounding box around all multi-modal information. The problem of object detection in computer vision has traditionally been viewed as a unimodal problem in the literature, which poses a significant challenge. This work presents a novel approach to automating object boundary identification in multi-modal scenarios. The study proposes an end-to-end method that employs transformers for detecting object boundaries in a multi-modal environment. The proposed model takes multi-scale image features, OCR-based text extraction, and 2D position embedding of words as input, which interact through self- and cross-attention mechanisms. Additionally, the study proposes a domain adaptation model to address the often significant domain gap between training and test samples in such scenarios. The proposed approach shows a significant improvement of 27.2%, 5.0% and 1.7% using hard negative samples, multi-modal and domain shift scenarios, respectively. The ablation studies confirm the effectiveness of the proposed components.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Verma_2024_WACV, author = {Verma, Vinay and Sanny, Dween and Singh, Abhishek and Gupta, Deepak}, title = {CoD: Coherent Detection of Entities From Images With Multiple Modalities}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {8015-8024} }