- [pdf] [supp]
Zero-Shot Learning Using Multimodal Descriptions
Zero-shot learning (ZSL) tackles the problem of recognition of unseen classes using only semantic descriptions, e.g., attributes. Current zero-shot learning techniques all assume that a single vector of attributes suffices to describe each category. We show that this assumption is incorrect. Many classes in real-world problems have multiple modes of appearance: male and female birds vary in appearance, for instance. Domain experts know this and can provide attribute descriptions of the chief modes of appearance for each class. Motivated by this, we propose the task of multimodal zero-shot learning, where the learner must learn from these multimodal attribute descriptions. We present new benchmarks for this task on CUB, SUN, and DeepFashion and a multimodal ZSL technique that outperform the unimodal counterpart significantly. Because it allows annotators to provide more than one description, we posit that multimodal ZSL is more practical for real-world deployment.