Self-Supervised Learning of Contextualized Local Visual Embeddings

Thalles Silva, Helio Pedrini, Adín Ramírez; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 177-186

Abstract


We present Contextualized Local Visual Embeddings (CLoVE), a self-supervised convolutional-based method that learns representations suited for dense prediction tasks. CLoVE deviates from current methods and optimizes a single loss function that operates at the level of contextualized local embeddings learned from output feature maps of convolutional neural network (CNN) encoders. To learn contextualized embeddings, CLoVE proposes a normalized mult-head self-attention layer that combines local features from different parts of an image based on similarity. We extensively benchmark CLoVE's pre-trained representations on multiple datasets. CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks, including object detection, instance segmentation, keypoint detection, and dense pose estimation. Code: https://github.com/sthalles/CLoVE.

Related Material


[pdf]
[bibtex]
@InProceedings{Silva_2023_ICCV, author = {Silva, Thalles and Pedrini, Helio and Ram{\'\i}rez, Ad{\'\i}n}, title = {Self-Supervised Learning of Contextualized Local Visual Embeddings}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {177-186} }