Contrastive Attention Maps for Self-Supervised Co-Localization

Minsong Ki, Youngjung Uh, Junsuk Choe, Hyeran Byun; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2803-2812


The goal of unsupervised co-localization is to locate the object in a scene under the assumptions that 1) the dataset consists of only one superclass, e.g., birds, and 2) there are no human-annotated labels in the dataset. The most recent method achieves impressive co-localization performance by employing self-supervised representation learning approaches such as predicting rotation. In this paper, we introduce a new contrastive objective directly on the attention maps to enhance co-localization performance. Our contrastive loss function exploits rich information of location, which induces the model to activate the extent of the object effectively. In addition, we propose a pixel-wise attention pooling that selectively aggregates the feature map regarding their magnitudes across channels. Our methods are simple and shown effective by extensive qualitative and quantitative evaluation, achieving state-of-the-art co-localization performances by large margins on four datasets: CUB-200-2011, Stanford Cars, FGVC-Aircraft, and Stanford Dogs. Our code will be publicly available online for the research community.

Related Material

[pdf] [supp]
@InProceedings{Ki_2021_ICCV, author = {Ki, Minsong and Uh, Youngjung and Choe, Junsuk and Byun, Hyeran}, title = {Contrastive Attention Maps for Self-Supervised Co-Localization}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {2803-2812} }