CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification

Marcos V. Conde, Kerem Turgutlu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021, pp. 3956-3960

Abstract


Existing computer vision research in artwork struggles with artwork's fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. In this work, we use CLIP (Contrastive Language-Image Pre-Training) for training a neural network on a variety of art images and text pairs, being able to learn directly from raw descriptions about images, or if available, curated labels. Model's zero-shot capability allows predicting the most relevant natural language description for a given image, without directly optimizing for the task. Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition. We use the iMet Dataset, which we consider the largest annotated artwork dataset. Our code and models will be available at https://github.com/KeremTurgutlu/clip_art

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Conde_2021_CVPR, author = {Conde, Marcos V. and Turgutlu, Kerem}, title = {CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2021}, pages = {3956-3960} }