Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Zhang, Yiyuan; Ding, Xiaohan; Gong, Kaixiong; Ge, Yixiao; Shan, Ying; Yue, Xiangyu

Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 6108-6117

Abstract

We propose to improve transformers of a specific modality with irrelevant data from other modalities e.g. improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities which distinguishes our method from other works utilizing paired (e.g. CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization which exploits the auxiliary weights without any inference costs. On the image point cloud video and audio recognition tasks we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Zhang_2024_CVPR, author = {Zhang, Yiyuan and Ding, Xiaohan and Gong, Kaixiong and Ge, Yixiao and Shan, Ying and Yue, Xiangyu}, title = {Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {6108-6117} }