What To Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions

A S M Iftekhar, Hao Chen, Kaustav Kundu, Xinyu Li, Joseph Tighe, Davide Modolo; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5353-5363

Abstract


We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two new modules to help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features. These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Iftekhar_2022_CVPR, author = {Iftekhar, A S M and Chen, Hao and Kundu, Kaustav and Li, Xinyu and Tighe, Joseph and Modolo, Davide}, title = {What To Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {5353-5363} }