Recognize Anything: A Strong Image Tagging Model

Zhang, Youcai; Huang, Xinyu; Ma, Jinyu; Li, Zhaoyang; Luo, Zhaochuan; Xie, Yanchun; Qin, Yuzhuo; Luo, Tong; Li, Yaqian; Liu, Shilong; Guo, Yandong; Zhang, Lei

Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, Yandong Guo, Lei Zhang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 1724-1732

Abstract

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for foundation models in computer vision demonstrating the zero-shot ability to recognize any common category with high accuracy. By leveraging large-scale image-text pairs for training instead of manual annotations RAM introduces a new paradigm for image tagging. The development of RAM comprises four key steps. Firstly annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently a preliminary model is trained for automatic annotation by unifying the captioning and tagging tasks supervised by the original texts and parsed tags respectively. Thirdly a data engine is employed to generate additional annotations and clean incorrect ones. Lastly the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capability of RAM on numerous benchmarks and observe an impressive zero-shot performance which significantly outperforms CLIP and BLIP. Remarkably RAM even surpasses fully supervised models and exhibits a competitive performance compared with the Google tagging API. We have released RAM at https://recognize-anything.github.io/ to foster the advancement of foundation models in computer vision.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Zhang_2024_CVPR, author = {Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and Guo, Yandong and Zhang, Lei}, title = {Recognize Anything: A Strong Image Tagging Model}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {1724-1732} }