-
[pdf]
[supp]
[bibtex]@InProceedings{Yamada_2025_WACV, author = {Yamada, Moyuru and Dharamshi, Nimish and Kohli, Ayushi and Kasu, Prasad and Khan, Ainulla and Ghulyani, Manu}, title = {Unleashing Potentials of Vision-Language Models for Zero-Shot HOI Detection}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {5751-5760} }
Unleashing Potentials of Vision-Language Models for Zero-Shot HOI Detection
Abstract
Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions as <human action object> triplets. Recent advancements in pre-trained vision-language model (VLM) have improved zero-shot HOI detection enabling identification of unseen triplets. However existing methods leverage the VLM as an additional encoder only for interaction prediction not for human/object detection. This limitation hinders their ability to detect unseen objects. Furthermore the additional encoder increases both model size and computational cost. This paper proposes a novel HOI detection framework ECI-HOI which unleashes potentials of the pre-trained VLM for the zero-shot HOI detection by leveraging it for both of the sub-tasks. We first employ CLIP as a single image encoder reducing redundancy in the network architecture. In addition we propose an instance selector and a HO pair decoder to effectively harmonize the human/object detection and the interaction prediction in zero-shot manner. We evaluate our model under various settings on HICO-DET and our two new testsets: out-of-distribution image testset and novel object testset. Our model outperforms the state-of-the-art models while reducing the model size by more than 50% especially achieving a +10.01 mAP improvement under the unseen object setting on HICO-DET. The results on the proposed datasets highlight the zero-shot performance of our model on more challenging settings.
Related Material