- [pdf] [code]
Class Concentration with Twin Variational Autoencoders for Unsupervised Cross-modal Hashing
Multi-modal deep hash learning is arguably one of the most commonly used unsupervised methods in cross-modal retrieval tasks. Most existing deep hashing methods focus on maintaining similarity information in the hash code learning step. Although accurate and compact binary representations are learned, these methods fail to encourage discriminative learning of features. In this paper, we propose a new method called Class Concentrated Variational auto-encoder (CCTV) to learn discriminative hash codes. The novelty of CCTV lies in two aspects. First, the proposed method focuses on the concentration of the mean vector of latent features. Based on the assumption that the features in the shared latent space produce multivariate Gaussian, CCTV updates the mean vectors and the cluster centroids of the latent features at the same time by minimizing the class concentration loss, so as to narrow the distance between the cluster centroids and the mean vectors, and further make the concentration More compact. Secondly, under the constraints of raw similarity information, CCTV is different from previous works, it uses the mean vector of latent features as the representation of the images to reduce the influence of variance, and then embeds them in the Hamming space. Our experimental evaluation of four multimedia benchmarks shows a significant improvement over the state-of-the-art methods.