Exploiting CLIP for Zero-Shot HOI Detection Requires Knowledge Distillation at Multiple Levels

Bo Wan, Tinne Tuytelaars; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 1805-1815

Abstract


In this paper, we investigate the task of zero-shot human-object interaction (HOI) detection, a novel paradigm for identifying HOIs without the need for task-specific annotations. To address this challenging task, we employ CLIP, a large-scale pre-trained vision-language model (VLM), for knowledge distillation on multiple levels. To this end, we design a multi-branch neural network that leverages CLIP for learning HOI representations at various levels, including global images, local union regions encompassing human-object pairs, and individual instances of humans or objects. To train our model, CLIP is utilized to generate HOI scores for both global images and local union regions that serve as supervision signals. The extensive experiments demonstrate the effectiveness of our novel multi-level CLIP knowledge integration strategy. Notably, the model achieves strong performance, which is even comparable with some fully-supervised and weakly-supervised methods on the public HICO-DET benchmark.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Wan_2024_WACV, author = {Wan, Bo and Tuytelaars, Tinne}, title = {Exploiting CLIP for Zero-Shot HOI Detection Requires Knowledge Distillation at Multiple Levels}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {1805-1815} }