-
[pdf]
[supp]
[bibtex]@InProceedings{Mur-Labadia_2025_ICCV, author = {Mur-Labadia, Lorenzo and Santos-Villafranca, Maria and Bermudez-Cameo, Jesus and Perez-Yus, Alejandro and Martinez-Cantin, Ruben and Guerrero, Jose J.}, title = {O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {6892-6903} }
O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views
Abstract
Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem. We introduce a new approach that re-defines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego-Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects. O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark, obtaining relative gains of +22 % and +76 % in the Ego2Exo and Exo2Ego IoU against the official challenge baselines, and a +13 % and +6 % compared with the SOTA with 1 % of the training parameters.
Related Material
