Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1845-1855

Abstract


We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and object. To address these challenges, we introduce a contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision. To train our model, we introduce a dataset comprising over 6.5k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos. We demonstrate improved performance over weakly supervised baselines adapted to our task on our video dataset.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Li_2021_ICCV, author = {Li, Shuang and Du, Yilun and Torralba, Antonio and Sivic, Josef and Russell, Bryan}, title = {Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {1845-1855} }