SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling

Ju-Hee Lee, Je-Won Kang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13689-13699

Abstract


In recent years large-scale video-language pre-training (VidLP) has received considerable attention for its effectiveness in relevant tasks. In this paper we propose a novel action-centric VidLP framework that employs video tube features for temporal modeling and language features based on semantic role labeling (SRL). Our video encoder generates multiple tube features along object trajectories identifying action-related regions within videos to overcome the limitations of existing temporal attention mechanisms. Additionally our text encoder incorporates high-level action-related language knowledge previously underutilized in current VidLP models. The SRL captures action-verbs and related semantics among objects in sentences and enhances the ability to perform instance-level text matching thus enriching the cross-modal (CM) alignment process. We also introduce two novel pre-training objectives and a self-supervision strategy to produce a more faithful CM representation. Experimental results demonstrate that our method outperforms existing VidLP frameworks in various downstream tasks and datasets establishing our model a baseline in the modern VidLP framework.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Lee_2024_CVPR, author = {Lee, Ju-Hee and Kang, Je-Won}, title = {SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13689-13699} }