ModaVerse: Efficiently Transforming Modalities with LLMs

Xinyu Wang, Bohan Zhuang, Qi Wu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26606-26616

Abstract


Humans possess the capability to comprehend diverse modalities and seamlessly transfer information between them. In this work we introduce ModaVerse a Multi-modal Large Language Model (MLLM) capable of comprehending and transforming content across various modalities including images videos and audio. Predominant MLLM frameworks have largely relied on aligning latent spaces of textual and non-textual features. This alignment process which synchronizes a language model trained on textual data with encoders and decoders trained on multi-modal data often necessitates extensive training of several projection layers in multiple stages. Inspired by LLM-as-agent methodologies we propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language. It aligns the LLM's output with the input of generative models avoiding the complexities associated with latent feature alignments and simplifying the multiple training stages of existing MLLMs into a single efficient process. By conducting experiments on several benchmarks we demonstrate that our approach attains comparable performance with the state of the art while achieving considerable efficiencies in data usage.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Wang_2024_CVPR, author = {Wang, Xinyu and Zhuang, Bohan and Wu, Qi}, title = {ModaVerse: Efficiently Transforming Modalities with LLMs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {26606-26616} }