Unbiased Missing-modality Multimodal Learning

Ruiting Dai, Chenxi Li, Yandong Yan, Lisi Mo, Ke Qin, Tao He; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 24507-24517

Abstract


Recovering missing modalities in multimodal learning has recently been approached using diffusion models to synthesize absent data conditioned on available modalities. However, existing methods often suffer from modality generation bias: while certain modalities are generated with high fidelity, others--such as video--remain challenging due to intrinsic modality gaps, leading to imbalanced training. To address this issue, we propose MD^2N (Multi-stage Duplex Diffusion Network), a novel framework for unbiased missing-modality recovery. MD^2N introduces a modality transfer module within a duplex diffusion architecture, enabling bidirectional generation between available and missing modalities through three stages: (1) global structure generation, (2) modality transfer, and (3) local cross-modal refinement. By training with duplex diffusion, both available and missing modalities generate each other in an intersecting manner, effectively achieving a balanced generation state.Extensive experiments demonstrate that MD^2N significantly outperforms existing state-of-the-art methods, achieving up to 4% improvement over IMDer on the CMU-MOSEI dataset. Project page: https://crystal-punk.github.io/.

Related Material


[pdf]
[bibtex]
@InProceedings{Dai_2025_ICCV, author = {Dai, Ruiting and Li, Chenxi and Yan, Yandong and Mo, Lisi and Qin, Ke and He, Tao}, title = {Unbiased Missing-modality Multimodal Learning}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {24507-24517} }