Zero-Shot Learning Using Multimodal Descriptions

Utkarsh Mall, Bharath Hariharan, Kavita Bala; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 3931-3939


Zero-shot learning (ZSL) tackles the problem of recognition of unseen classes using only semantic descriptions, e.g., attributes. Current zero-shot learning techniques all assume that a single vector of attributes suffices to describe each category. We show that this assumption is incorrect. Many classes in real-world problems have multiple modes of appearance: male and female birds vary in appearance, for instance. Domain experts know this and can provide attribute descriptions of the chief modes of appearance for each class. Motivated by this, we propose the task of multimodal zero-shot learning, where the learner must learn from these multimodal attribute descriptions. We present new benchmarks for this task on CUB, SUN, and DeepFashion and a multimodal ZSL technique that outperform the unimodal counterpart significantly. Because it allows annotators to provide more than one description, we posit that multimodal ZSL is more practical for real-world deployment.

Related Material

[pdf] [supp]
@InProceedings{Mall_2022_CVPR, author = {Mall, Utkarsh and Hariharan, Bharath and Bala, Kavita}, title = {Zero-Shot Learning Using Multimodal Descriptions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2022}, pages = {3931-3939} }