Event-Guided Video Transformer for End-to-End 3D Human Pose Estimation

Lang, Bo; Chuah, Mooi Choo

Bo Lang, Mooi Choo Chuah; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 5114-5124

Abstract

3D human pose estimation (3D HPE) is an important computer vision task with various practical applications. However 3D pose estimation for multi-person from a monocular video (3DMPPE) is particularly challenging. Recent transformer-based approaches focus on capturing the spatial-temporal information from sequential 2D poses which unfortunately loses the visual feature relevant for 3D pose estimation. In this paper we propose an end-to-end framework called Event Guided Video Transformer (EVT) which predicts 3D poses directly from video frames by learning spatial-temporal contextual information from visual features effectively. In addition our design is the first that incorporates event features to help guide 3D pose estimation. EVT first decouples persons into different instance-aware feature maps from video frames. These features containing specific clues of body structure information are then fed together with event features into an attention based Event-Aware Embedding Module. Next the fused features for each instance are then fed into an intra-human relation extraction module and subsequently to a temporal transformer to extract inter-frame relationship. Finally the extracted features are fed into a decoder for 3D pose estimation. Experiments using three widely used 3D pose estimation benchmarks show that our proposed EVT achieves better performance than state-of-the-art models.

Related Material

[pdf]

[bibtex]

@InProceedings{Lang_2025_WACV, author = {Lang, Bo and Chuah, Mooi Choo}, title = {Event-Guided Video Transformer for End-to-End 3D Human Pose Estimation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {5114-5124} }