Link the Head to the "Beak": Zero Shot Learning From Noisy Text Description at Part Precision

Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, Ahmed Elgammal; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5640-5649

Abstract


In this paper, we study learning visual classifiers from unstructured text description at part precision with no training images. We show that visual text terms can be encouraged to attend to its relevant parts, while image connections to non-visual text terms vanishes without any supervision. This learning process enables terms like "peak" to be linked to parts like only head for instance , while non-visual terms like "migrate" not to affect classifier prediction without part-text annotation. Images are encoded by a part-based CNN that detect bird parts and learn part-specific learning representation. Part-based visual classifiers are predicted from text descriptions of unseen visual classifiers to facilitate classification without training images (also known as zero-shot recognition ). We performed our experiments on CUB200 dataset and improves the zero-shot recognition results from 34.2% to 44.0%. We also created a large scale benchmark on 404 North American Bird Images with text descriptions, where we also showed that our method outperforming existing methods.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Elhoseiny_2017_CVPR,
author = {Elhoseiny, Mohamed and Zhu, Yizhe and Zhang, Han and Elgammal, Ahmed},
title = {Link the Head to the "Beak": Zero Shot Learning From Noisy Text Description at Part Precision},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {July},
year = {2017}
}