Handformer2T: A Lightweight Regression-Based Model for Interacting Hands Pose Estimation From a Single RGB Image

Pengfei Zhang, Deying Kong; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 6248-6257

Abstract


Despite its extensive range of potential applications in virtual reality and augmented reality, 3D hand pose estimation from RGB image remains a very challenging problem. The appearance confusions between the two hands and their joints, along with severe hand-hand occlusion and self-occlusion, makes it even more difficult in the senario of interacting hands. Previous methods deal with this problem at the joint level and generally use a heatmap-based method for coordinate prediction. In this paper, we propose a regression-based method that can deal with joint regression at the hand level, which makes the model much more lightweight and memory efficient. To achieve this, we design a novel Pose Query Enhancer (PQE) module, which takes the coarse joint prediction for each hand and refine the prediction iteratively. The key idea of PQE is to make the regression model focus more on the information near proposed joint prediction by manually sampling the feature map. Since we always adopt the transformer on hand level, our model remains lightweight amd memory friendly with this module. Experiments on public benchmarks demonstrate that our model achieves state-of-the-art performance with higher throughput, while requiring less memory and time.

Related Material


[pdf]
[bibtex]
@InProceedings{Zhang_2024_WACV, author = {Zhang, Pengfei and Kong, Deying}, title = {Handformer2T: A Lightweight Regression-Based Model for Interacting Hands Pose Estimation From a Single RGB Image}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {6248-6257} }