3D Hand Pose Estimation with Disentangled Cross-Modal Latent Space

Jiajun Gu, Zhiyong Wang, Wanli Ouyang, weichen zhang, Jiafeng Li, Li Zhuo; The IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 391-400

Abstract


Estimating 3D hand pose from a single RGB image is a challenging task because of its ill-posed nature (i.e., depth ambiguity). Recently, various generative-based approaches have been proposed to predict the 3D joints by learning a unified latent space between two modalities (i.e., RGB image and 3D joints). However, projecting multi-modal data(i.e., RGB images and 3D joints) into a unified latent space is difficult as the modality-specific features usually inter-fere the learning of the optimal latent space. Hence in this paper, we propose to disentangle the latent space into two sub-latent spaces: modality-specific latent space and pose-specific latent space for 3D hand pose estimation. Our proposed method, namely Disentangled Cross-Modal Latent Space (DCMLS), consists of two variational autoencoder networks and auxiliary components which connects the two VAEs to align underlying hand poses and transfer modality context from RGB to 3D. For the hand pose latent space,we align the hand pose latent space from the two modalities by using a cross-modal discriminator with the adversarial learning strategy. For the context latent space, we learn acontext translator to gain access to the cross-modal con-text. Experimental results on two widely used public bench-mark datasets RHD and STB demonstrate that our proposed DCMLS method is able to outperform the state-of-the-artones on single image based 3D hand pose estimation.

Related Material


[pdf]
[bibtex]
@InProceedings{Gu_2020_WACV,
author = {Gu, Jiajun and Wang, Zhiyong and Ouyang, Wanli and zhang, weichen and Li, Jiafeng and Zhuo, Li},
title = {3D Hand Pose Estimation with Disentangled Cross-Modal Latent Space},
booktitle = {The IEEE Winter Conference on Applications of Computer Vision (WACV)},
month = {March},
year = {2020}
}