Mask4Align: Aligned Entity Prompting with Color Masks for Multi-Entity Localization Problems

Haoquan Zhang, Ronggang Huang, Yi Xie, Huaidong Zhang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13373-13383

Abstract


In Visual Question Answering (VQA) recognizing and localizing entities pose significant challenges. Pretrained vision-and-language models have addressed this problem by providing a text description as the answer. However in visual scenes with multiple entities textual descriptions struggle to distinguish the entities from the same category effectively. Consequently the VQA dataset is limited by the limitations of text description and cannot adequately cover scenarios involving multiple entities. To address this challenge we introduce a Mask for Align (Mask4Align) method which can determine the entity's position in the given image that best matches the user-input question. This method incorporates colored masks into the image enabling the VQA model to handle discrimination and localization challenges associated with multiple entities. To process an arbitrary number of similar entities Mask4Align is designed hierarchically to discern subtle differences achieving precise localization. Since Mask4Align directly utilizes pre-trained models it does not introduce additional training overhead. Extensive experiments conducted on both the gaze target prediction task dataset and our proposed multi-entity localization dataset showcase the superiority of Mask4Align.

Related Material


[pdf]
[bibtex]
@InProceedings{Zhang_2024_CVPR, author = {Zhang, Haoquan and Huang, Ronggang and Xie, Yi and Zhang, Huaidong}, title = {Mask4Align: Aligned Entity Prompting with Color Masks for Multi-Entity Localization Problems}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13373-13383} }