-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Iftekhar_2022_CVPR, author = {Iftekhar, A S M and Chen, Hao and Kundu, Kaustav and Li, Xinyu and Tighe, Joseph and Modolo, Davide}, title = {What To Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {5353-5363} }
What To Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions
Abstract
We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two new modules to help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features. These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET.
Related Material