Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26439-26455

Abstract


We present Unified-IO 2 a multimodal and multi-skill unified model capable of following novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can generate text image or audio outputs which is accomplished in a unified way by tokenizing these different inputs and outputs into a shared semantic space that can then be processed by a single encoder-decoder transformer model. Unified-IO 2 is trained from scratch on a custom-built multimodal pre-training corpus and then learns an expansive set of skills through fine-tuning on over 120 datasets including datasets for segmentation object detection image editing audio localization video tracking embodied AI and 3D detection. To facilitate instruction-following we add prompts and other data augmentations to these tasks to allow Unified-IO 2 to generalize these skills to new tasks zero-shot. Unified-IO 2 is the first model to be trained on such a diverse and wide-reaching set of skills and unify three separate generation capabilities. Unified-IO 2 achieves state-of-the-art performance on the multi-task GRIT benchmark and achieves strong results on 30 diverse datasets including SEED-Bench image and video understanding TIFA image generation VQA 2.0 ScienceQA VIMA robotic manipulation VGG-Sound and Kinetics-Sounds and can perform unseen tasks and generate free-form responses. We release our model and code to facilitate future work.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Lu_2024_CVPR, author = {Lu, Jiasen and Clark, Christopher and Lee, Sangho and Zhang, Zichen and Khosla, Savya and Marten, Ryan and Hoiem, Derek and Kembhavi, Aniruddha}, title = {Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {26439-26455} }