-
[pdf]
[supp]
[bibtex]@InProceedings{Wu_2024_CVPR, author = {Wu, Haoning and Zhang, Zicheng and Zhang, Erli and Chen, Chaofeng and Liao, Liang and Wang, Annan and Xu, Kaixin and Li, Chunyi and Hou, Jingwen and Zhai, Guangtao and Xue, Geng and Sun, Wenxiu and Yan, Qiong and Lin, Weisi}, title = {Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {25490-25500} }
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
Abstract
Multi-modality large language models (MLLMs) as represented by GPT-4V have introduced a paradigm shift for visual perception and understanding tasks that a variety of abilities can be achieved within one foundation model. While current MLLMs demonstrate primary low-level visual abilities from the identification of low-level visual attributes (e.g. clarity brightness) to the evaluation on image quality there's still an imperative to further improve the accuracy of MLLMs to substantially alleviate human burdens. To address this we collect the first dataset consisting of human natural language feedback on low-level vision. Each feedback offers a comprehensive description of an image's low-level visual attributes culminating in an overall quality assessment. The constructed Q-Pathway dataset includes 58K detailed human feedbacks on 18973 multi-sourced images with diverse low-level appearance. To ensure MLLMs can adeptly handle diverse queries we further propose a GPT-participated transformation to convert these feedbacks into a rich set of 200K instruction-response pairs termed Q-Instruct. Experimental results indicate that the Q-Instruct consistently elevates various low-level visual capabilities across multiple base models. We anticipate that our datasets can pave the way for a future that foundation models can assist humans on low-level visual tasks.
Related Material