Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning

Ibn Abdul Hakim, Zaber; Sarker, Najibul Haque; Singh, Rahul Pratap; Paul, Bishmoy; Dabouei, Ali; Xu, Min

Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Rahul Pratap Singh, Bishmoy Paul, Ali Dabouei, Min Xu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 1975-1985

Abstract

A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks. However recent works have shown that the current models do not achieve a comprehensive understanding of the textual data during the training for the target downstream tasks. Orthogonal to the previous approaches to this limitation we postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models. Hence we utilize the knowledge of a pre-trained large language model (LLM) to generate text samples from the original ones targeting specific sentence components. We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks. Through rigorous quantitative analysis our proposed method exhibits significant improvement across several video-language tasks. In particular our approach notably enhances video-text retrieval by a relative improvement of 8.3% in video-to-text and 1.4% in text-to-video retrieval over the baselines in terms of R@1. Additionally in video moment retrieval average mAP shows a relative improvement ranging from 2.0% to 13.7% across different baselines.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Ibn_Abdul_Hakim_2024_CVPR, author = {Ibn Abdul Hakim, Zaber and Sarker, Najibul Haque and Singh, Rahul Pratap and Paul, Bishmoy and Dabouei, Ali and Xu, Min}, title = {Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {1975-1985} }