VidLA: Video-Language Alignment at Scale

Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, Trishul Chilimbi; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14043-14055

Abstract


In this paper we propose VidLA an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture we are able to initialize our video-language model with pretrained image-text foundation models thereby boosting the final performance. Second existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore unlike existing video-text datasets which only contain short clips our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks especially on longer videos and performs competitively on classification benchmarks.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Rizve_2024_CVPR, author = {Rizve, Mamshad Nayeem and Fei, Fan and Unnikrishnan, Jayakrishnan and Tran, Son and Yao, Benjamin Z. and Zeng, Belinda and Shah, Mubarak and Chilimbi, Trishul}, title = {VidLA: Video-Language Alignment at Scale}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {14043-14055} }