Detecting Human-Object Relationships in Videos

Jingwei Ji, Rishi Desai, Juan Carlos Niebles; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8106-8116


We study a crucial problem in video analysis: human-object relationship detection. The majority of previous approaches are developed only for the static image scenario, without incorporating the temporal dynamics so vital to contextualizing human-object relationships. We propose a model with Intra- and Inter-Transformers, enabling joint spatial and temporal reasoning on multiple visual concepts of objects, relationships, and human poses. We find that applying attention mechanisms among features distributed spatio-temporally greatly improves our understanding of human-object relationships. Our method is validated on two datasets, Action Genome and CAD-120-EVAR, and achieves state-of-the-art performance on both of them.

Related Material

[pdf] [supp]
@InProceedings{Ji_2021_ICCV, author = {Ji, Jingwei and Desai, Rishi and Niebles, Juan Carlos}, title = {Detecting Human-Object Relationships in Videos}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {8106-8116} }