Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

Yan, Shannan; Zheng, Leqi; Lv, Keyu; Ni, Jingchen; Wei, Hongyang; Zhang, Jiajun; Wang, Guangting; LYU, Jing; Yuan, Chun; Rao, Fengyun

Shannan Yan, Leqi Zheng, Keyu Lv, Jingchen Ni, Hongyang Wei, Jiajun Zhang, Guangting Wang, Jing LYU, Chun Yuan, Fengyun Rao; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 6653-6663

Abstract

We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Yan_2026_CVPR, author = {Yan, Shannan and Zheng, Leqi and Lv, Keyu and Ni, Jingchen and Wei, Hongyang and Zhang, Jiajun and Wang, Guangting and LYU, Jing and Yuan, Chun and Rao, Fengyun}, title = {Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {6653-6663} }