-
[pdf]
[supp]
[bibtex]@InProceedings{Sun_2025_WACV, author = {Sun, Li and Ahuja, Chaitanya and Chen, Peng and D'Zmura, Matt and Batmanghelich, Kayhan and Bontrager, Philip}, title = {Multi-Modal Large Language Models are Effective Vision Learners}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {8606-8615} }
Multi-Modal Large Language Models are Effective Vision Learners
Abstract
Large language models (LLMs) pre-trained on vast amounts of text have shown remarkable abilities in understanding general knowledge and commonsense. Therefore it's desirable to leverage pre-trained LLM to help solve computer vision tasks. Previous works on multi-modal LLM mainly focus on the generation capability. In this work we propose LLM-augmented visual representation learning (LMVR). Our approach involves initially using a vision encoder to extract features which are then projected into the word embedding space of the LLM. The LLM then generates responses based on the visual representation and a text prompt. Finally we aggregate sequence-level features from the hidden layers of the LLM to obtain image-level representations. We conduct extensive experiments on multiple datasets and have the following findings: (a) LMVR outperforms traditional vision encoder on various downstream tasks and effectively learns the correspondence between words and image regions; (b) LMVR improves the generalizability compared to using a vision encoder alone as evidenced by its superior resistance to domain shift; (c) LMVR improves the robustness of models to corrupted and perturbed visual data. Our findings demonstrate LLM-augmented visual representation learning is effective as it learns object-level concepts and commonsense knowledge.
Related Material