General Object Foundation Model for Images and Videos at Scale

Wu, Junfeng; Jiang, Yi; Liu, Qihao; Yuan, Zehuan; Bai, Xiang; Bai, Song

Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 3783-3795

Abstract

We present GLEE in this work an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework GLEEaccomplishes detection segmentation tracking grounding and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations excelling in zero-shot transfer to new data and tasks. Specifically we employ an image encoder text encoder and visual prompter to handle multi-modal inputs enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks GLEE exhibits remarkable versatility and improved generalization performance efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data we further enhance its zero-shot generalization capabilities. Additionally GLEE is capable of being integrated into Large Language Models serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The models and code are released at https://github.com/FoundationVision/GLEE.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Wu_2024_CVPR, author = {Wu, Junfeng and Jiang, Yi and Liu, Qihao and Yuan, Zehuan and Bai, Xiang and Bai, Song}, title = {General Object Foundation Model for Images and Videos at Scale}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {3783-3795} }