Detecting Arbitrary Intermediate Keypoints for Human Pose Estimation With Vision Transformers

Katja Ludwig, Philipp Harzig, Rainer Lienhart; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2022, pp. 663-671

Abstract


Most human pose estimation datasets have a fixed set of keypoints. Hence, trained models are only capable of detecting these defined points. Adding new points to the dataset requires a full retraining of the model. We present a model based on the Vision Transformer architecture that can detect these fixed points and arbitrary intermediate points without any computational overhead during inference time. Furthermore, independently detected intermediate keypoints can improve analyses derived from the keypoints such as the calculation of body angles. Our approach is based on TokenPose and replaces the fixed keypoint tokens with an embedding of human readable keypoint vectors to keypoint tokens. For ski jumpers, who benefit from intermediate detections especially of their skis, this model achieves the same performance as TokenPose on the fixed keypoints and can detect any intermediate keypoint directly.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ludwig_2022_WACV, author = {Ludwig, Katja and Harzig, Philipp and Lienhart, Rainer}, title = {Detecting Arbitrary Intermediate Keypoints for Human Pose Estimation With Vision Transformers}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {January}, year = {2022}, pages = {663-671} }