ViT-Lens: Towards Omni-modal Representations

Lei, Weixian; Ge, Yixiao; Yi, Kun; Zhang, Jianfeng; Gao, Difei; Sun, Dylan; Ge, Yuying; Shan, Ying; Shou, Mike Zheng

Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26647-26657

Abstract

Aiming to advance AI agents large foundation models significantly improve reasoning and instruction execution yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space pre-defined by off-the-shelf foundation models. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing advantages: (i) Unlocking the great potential of pretrained ViTs to novel modalities effectively with efficient data regime; (ii) Enabling emergent downstream capabilities through modality alignment and shared ViT parameters. We tailor ViT-Lens to learn representations for 3D point cloud depth audio tactile and EEG and set new state-of-the-art results across various understanding tasks such as zero-shot classification. By seamlessly integrating ViT-Lens into Multimodal Foundation Models we enable Any-modality to Text and Image Generation in a zero-shot manner. Code and models are available at https://github.com/TencentARC/ViT-Lens.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Lei_2024_CVPR, author = {Lei, Weixian and Ge, Yixiao and Yi, Kun and Zhang, Jianfeng and Gao, Difei and Sun, Dylan and Ge, Yuying and Shan, Ying and Shou, Mike Zheng}, title = {ViT-Lens: Towards Omni-modal Representations}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {26647-26657} }