Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding

Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, Gang Hua; The IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1881-1889


We address the problem of dense visual-semantic embedding that maps not only full sentences and whole images but also phrases within sentences and salient regions within images into a multimodal embedding space. As a result, we can produce several region-oriented and expressive phrases rather than just an overview sentence to describe an image. In particular, we present a hierarchical structured recurrent neural network (RNN), namely Hierarchical Multimodal LSTM (HM-LSTM) model. Different from chain structured RNN, our model presents a hierarchical structure so that it can naturally build representations for phrases and image regions, and further exploit their hierarchical relations. Moreover, the fine-grained correspondences between phrases and image regions can be automatically learned and utilized to boost the learning of the dense embedding space. Extensive experiments on several datasets validate the efficacy of our proposed method, which compares favorably with the state-of-the-art methods.

Related Material

author = {Niu, Zhenxing and Zhou, Mo and Wang, Le and Gao, Xinbo and Hua, Gang},
title = {Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2017}