Skew-Robust Human-Object Interactions in Videos

Apoorva Agarwal, Rishabh Dabral, Arjun Jain, Ganesh Ramakrishnan; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 5098-5107


Humans are, arguably, one of the most important regions of interest in a visual analysis pipeline. Detecting how the human interacts with the surrounding environment, thus, becomes an important problem and has several potential use-cases. While this has been adequately addressed in the literature in the image setting, there exist very few methods addressing the case for in-the-wild videos. The problem is further exacerbated by the high degree of label skew. To this end, we propose SeRVo-HOI, a robust end-to-end framework for recognizing human-object interactions from a video, particularly in high label-skew settings. The network contextualizes multiple image representations and is trained to explicitly handle dataset skew. We propose and analyse methods to address the long-tail distribution of the labels and show improvements on the tail-labels. SeRVo-HOI outperforms the state-of-the-art by a significant margin 21.1% vs 17.6% mAP on the large-scale, in-the-wild VidHOI dataset while particularly demonstrating solid improvements in the tail-classes 19.9% vs 17.3% mAP.

Related Material

@InProceedings{Agarwal_2023_WACV, author = {Agarwal, Apoorva and Dabral, Rishabh and Jain, Arjun and Ramakrishnan, Ganesh}, title = {Skew-Robust Human-Object Interactions in Videos}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {5098-5107} }