UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, Lin Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26752-26762

Abstract


We present UniBind a flexible and efficient approach that learns a unified representation space for seven diverse modalities-- images text audio point cloud thermal video and event data. Existing works eg. ImageBind treat the image as the central modality and build an image-centered representation space; however the space may be sub-optimal as it leads to an unbalanced representation space among all modalities. Moreover the category names are directly used to extract text embeddings for the downstream tasks making it hardly possible to represent the semantics of multi-modal data. The 'out-of-the-box' insight of our UniBind is to make the alignment center modality-agnostic and further learn a unified and balanced representation space empowered by the large language models (LLMs). UniBind is superior in its flexible application to all CLIP-style models and delivers remarkable performance boosts. To make this possible we 1) construct a knowledge base of text embeddings with the help of LLMs and multi-modal LLMs; 2) adaptively build LLM-augmented class-wise embedding center on top of the knowledge base and encoded visual embeddings; 3) align all the embeddings to the LLM-augmented embedding center via contrastive learning to achieve a unified and balanced representation space. UniBind shows strong zero-shot recognition performance gains over prior arts by an average of 6.36%. Finally we achieve new state-of-the-art performance eg. a 6.75% gain on ImageNet on the multi-modal fine-tuning setting while reducing 90% of the learnable parameters.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Lyu_2024_CVPR, author = {Lyu, Yuanhuiyi and Zheng, Xu and Zhou, Jiazhou and Wang, Lin}, title = {UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {26752-26762} }