FELGA: Unsupervised Fragment Embedding for Fine-Grained Cross-Modal Association

Yaoxin Zhuo, Baoxin Li; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 5635-5645

Abstract


Vision-and-Language Pre-trained (VLP) models have demonstrated their powerful zero-shot ability in multiple downstream tasks. Most of these models are designed to learn joint embeddings of images and their paired sentences, with both modalities considered globally. This does not lead to optimal solutions for applications where what matters more is the local-level cross-modal association, such as the situation where a user may want to retrieve images with query words that link to only small parts of the images. While a VLP model could in principle be retrained to learn a new embedding capturing such fine-grained association, expensive annotation would be needed, making it impractical for big data applications. This paper proposes a novel method named Fragment Embedding by Local and Global Alignment (FELGA), which learns fragment-level embeddings that capture fine-grained cross-modal association through utilizing visual entity proposals and semantic concept proposals in an unsupervised manner. Comprehensive experiments conducted on three VLP models and two datasets demonstrate that FELGA is not limited to specific VLP models and outperforms the original VLP features. In particular, the learned embeddings support cross-modal fragment association tasks including query-driven object discovery and description assignment.

Related Material


[pdf]
[bibtex]
@InProceedings{Zhuo_2024_WACV, author = {Zhuo, Yaoxin and Li, Baoxin}, title = {FELGA: Unsupervised Fragment Embedding for Fine-Grained Cross-Modal Association}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {5635-5645} }