Feature Alignment and Compositional Token for Human Pose Estimation

Po-Chi Hsu, Ming-Han Lee, Kun-Ru Wu, Yu-Chee Tseng; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2026, pp. 115-124

Abstract


Video-based 2D human pose estimation in real-world surveillance often suffers from motion blur, severe self-occlusion, and body parts moving outside camera boundaries. Existing heatmap-based methods exploit temporal cues but still fail to maintain structural consistency, resulting in unstable or unrealistic pose predictions under degraded visual conditions. We present FACT-Pose, a robust video pose estimation framework that combines feature alignment with compositional token representations. The feature alignment module performs global and local spatiotemporal alignment to extract reliable motion cues while suppressing noise caused by blur or rapid movement. The compositional token representation further models human substructures through learnable tokens, enabling the recovery of occluded or blurred joints and eliminating quantization errors inherent to heatmap-based outputs. To better reflect real surveillance environments, we introduce Human3.6M-OB (Occluded and Blurred), an augmented benchmark with sensor-accurate annotations and real-world challenges including motion blur and boundary truncation. Experiments show that FACT-Pose achieves more stable and accurate predictions across all challenging conditions, demonstrating strong potential for deployment in practical surveillance scenarios.

Related Material


[pdf]
[bibtex]
@InProceedings{Hsu_2026_WACV, author = {Hsu, Po-Chi and Lee, Ming-Han and Wu, Kun-Ru and Tseng, Yu-Chee}, title = {Feature Alignment and Compositional Token for Human Pose Estimation}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {March}, year = {2026}, pages = {115-124} }