Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast

Fan Yang, Yuanzhi Zhao, Haimei Zhao, Yudong Zhao, Haikun Xu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 30151-30161

Abstract


In unsupervised cross-modal hashing, real world multimodal data often exhibit partial alignment and semantic ambiguity. Dominant modalities can easily bias the fusion process, while semantically related samples may be mistakenly treated as negatives in contrastive learning, leading to unstable optimization. To address these issues, we propose Unsupervised Weighted Masked Contrastive Hashing (UWMCH). UWMCH introduces masking before multimodal fusion to construct partially observed interactions, encouraging the model to learn complementary semantics and reducing over-reliance on dominant modality cues. We further develop a semantic affinity guided weighted contrastive objective to reduce the influence of false negatives by combining instance level consistency with a cluster consensus prior. In addition, the global and local semantic geometries of the fused space are stabilized via Cluster-Centroid Agreement (CCA) and Semantic Structure Regularization (SSR). Extensive experiments on three benchmark datasets demonstrate the effectiveness and robustness of the proposed method.

Related Material


[pdf]
[bibtex]
@InProceedings{Yang_2026_CVPR, author = {Yang, Fan and Zhao, Yuanzhi and Zhao, Haimei and Zhao, Yudong and Xu, Haikun}, title = {Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {30151-30161} }