Self-Supervised Video Interaction Classification Using Image Representation of Skeleton Data
Recognizing interactions from sports games broadcast videos is an application of Interaction Recognition from Videos (IRV), that offers many challenges due to complex interactions that are often recorded from a suboptimal view point. Annotating large scale sports specific datasets are expensive and time-consuming. Therefore, in this study, we propose to demonstrate the effectiveness of applying Self-Supervised Learning (SSL) methods for building useful representations from human skeleton pose data (pose for short) without requiring costly annotations for a large scale dataset. Given the numerous well established image-based SSL methods, we demonstrate how to adapt them for sequences of pose through data transformation and a series of pose-based augmentations. We specifically adapt the Relational Reasoning SSL (Relational-SSL for short)  and achieve 68.18 +- 0% and 76.62 +- 2.7% in linear evaluation and finetuning protocols, respectively, for the downstream task of IRV from sports broadcast videos. Lastly, we run ablation studies on different components of the method, including the effect of using estimated pose (versus ground truth) on the performance of the downstream task.