SV-data2vec: Guiding Video Representation Learning with Latent Skeleton Targets

Zorana Doždor, Tomislav Hrkac, Zoran Kalafatic; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 6967-6976

Abstract


Recent advancements in action recognition leverage both skeleton and video modalities to achieve state-of-the-art performance. However due to the challenges of early fusion which tends to underutilize the strengths of each modality existing methods often resort to late fusion consequently leading to more complex designs. Additionally self-supervised learning approaches utilizing both modalities remain underexplored. In this paper we introduce SV-data2vec a novel self-supervised framework for learning from skeleton and video data. Our approach employs a student-teacher architecture where the teacher network generates contextualized targets based on skeleton data. The student network performs a masked prediction task using both skeleton and visual data. Remarkably after pretraining with both modalities our method allows for fine-tuning with RGB data alone achieving results on par with multimodal approaches by effectively learning video representations through skeleton data guidance. Extensive experiments on benchmark datasets NTU RGB+D 60 NTU RGB+D 120 and Toyota Smarthome confirm that our method outperforms existing RGB based state-of-the-art techniques. The code is available at github.com/zoranadozdor/SVdata2vec.

Related Material


[pdf]
[bibtex]
@InProceedings{Dozdor_2025_WACV, author = {Do\v{z}dor, Zorana and Hrkac, Tomislav and Kalafatic, Zoran}, title = {SV-data2vec: Guiding Video Representation Learning with Latent Skeleton Targets}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6967-6976} }