Unleash the Potential of CLIP for Video Highlight Detection

Donghoon Han, Seunghyeon Seo, Eunhwan Park, Seong-Uk Nam, Nojun Kwak; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 8275-8279

Abstract


Multimodal and large language models (LLMs) have revolutionized the utilization of open-world knowledge unlocking novel potentials across various tasks and applications. Among these domains the video domain has notably benefited from their capabilities. In this paper we present Highlight-CLIP (HL-CLIP) a method designed to excel in the video highlight detection task by leveraging the pre-trained knowledge embedded in multimodal models. By simply fine-tuning the multimodal encoder in combination with our innovative saliency pooling technique we have achieved the state-of-the-art performance in the highlight detection task the QVHighlight Benchmark to the best of our knowledge.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Han_2024_CVPR, author = {Han, Donghoon and Seo, Seunghyeon and Park, Eunhwan and Nam, Seong-Uk and Kwak, Nojun}, title = {Unleash the Potential of CLIP for Video Highlight Detection}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {8275-8279} }