Multimodal Knowledge Expansion

Zihui Xue, Sucheng Ren, Zhengqi Gao, Hang Zhao; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 854-863

Abstract


The popularity of multimodal sensors and the accessibility of the Internet have brought us a massive amount of unlabeled multimodal data. Since existing datasets and well-trained models are primarily unimodal, the modality gap between a unimodal network and unlabeled multimodal data poses an interesting problem: how to transfer a pre-trained unimodal network to perform the same task on unlabeled multimodal data? In this work, we propose multimodal knowledge expansion (MKE), a knowledge distillation-based framework to effectively utilize multimodal data without requiring labels. Opposite to traditional knowledge distillation, where the student is designed to be lightweight and inferior to the teacher, we observe that the multimodal student model consistently rectifies pseudo labels and generalizes better than its teacher. Extensive experiments on four tasks and different modalities verify this finding. Furthermore, we connect the mechanism of MKE to semi-supervised learning and offer both empirical and theoretical explanations to understand the expansion capability of a multimodal student.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Xue_2021_ICCV, author = {Xue, Zihui and Ren, Sucheng and Gao, Zhengqi and Zhao, Hang}, title = {Multimodal Knowledge Expansion}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {854-863} }