Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Jialun Cai, Nicu Sebe; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 604-613

Abstract


Transformers have been successfully applied in the field of video-based 3D human pose estimation. However the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper we present a plug-and-play pruning-and-recovering framework called Hourglass Tokenizer (HoT) for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens thereby expanding the network output to the original full-length temporal resolution for fast inference. Extensive experiments on two benchmark datasets (i.e. Human3.6M and MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and estimation accuracy compared to the original VPT models. For instance applying to MotionBERT and MixSTE on Human3.6M our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop respectively. Code and models are available at https://github.com/NationalGAILab/HoT.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Li_2024_CVPR, author = {Li, Wenhao and Liu, Mengyuan and Liu, Hong and Wang, Pichao and Cai, Jialun and Sebe, Nicu}, title = {Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {604-613} }