Duplicate Discovery on 2 Billion Internet Images

Wang, Xin-Jing; Zhang, Lei; Liu, Ce

Duplicate Discovery on 2 Billion Internet Images

Xin-Jing Wang, Lei Zhang, Ce Liu; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2013, pp. 429-436

Abstract

Duplicate image discovery, or discovering duplicate image clusters, is a challenging problem for billions of Internet images due to the lack of good distance metric which both covers the large variation within a duplicate image cluster and eliminates false alarms. After carefully investigating existing local and global features that have been widely used for large-scale image search and indexing, we propose a two-step approach that combines both local and global features: global descriptors are used to discover seed clusters with high precision, whereas local descriptors are used to grow the seeds to cover good recall. Using efficient hashing techniques for both features and the MapReduce framework, our system is able to discover about 553.8 million duplicate images from 2 billion Internet images within 13 hours on a 2,000 core cluster.

Related Material

[pdf]

[bibtex]

@InProceedings{Wang_2013_CVPR_Workshops,
author = {Wang, Xin-Jing and Zhang, Lei and Liu, Ce},
title = {Duplicate Discovery on 2 Billion Internet Images},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2013}
}