-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Yuan_2024_CVPR, author = {Yuan, Yuqian and Li, Wentong and Liu, Jian and Tang, Dongqi and Luo, Xinjie and Qin, Chi and Zhang, Lei and Zhu, Jianke}, title = {Osprey: Pixel Understanding with Visual Instruction Tuning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {28202-28211} }
Osprey: Pixel Understanding with Visual Instruction Tuning
Abstract
Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However current MLLMs primarily focus on image-level or box-level understanding falling short in achieving fine-grained vision-language alignment at pixel level. Besides the lack of mask-based instruction data limits their advancements. In this paper we propose Osprey a mask-text instruction tuning approach to extend MLLMs by incorporating fine-grained mask regions into language instruction aiming at achieving pixel-wise visual understanding. To achieve this goal we first meticulously curate a mask-based region-text dataset with 724K samples and then design a vision-language model by injecting pixel-level representation into LLM. Specifically Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks showcasing its new capability for pixel-level instruction tuning. In particular Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code dataset and demo can be found at https://github.com/CircleRadon/Osprey.
Related Material