Exploring Temporal Concurrency for Video-Language Representation Learning

Heng Zhang, Daqing Liu, Zezhong Lv, Bing Su, Dacheng Tao; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 15568-15578


Paired video and language data is naturally temporal concurrency, which requires the modeling of the temporal dynamics within each modality and the temporal alignment across modalities simultaneously. However, most existing video-language representation learning methods only focus on discrete semantic alignment that encourages aligned semantics to be close in the latent space, or temporal context dependency that captures short-range coherence, failing in building the temporal concurrency. In this paper, we propose to learn video-language representations by modeling video-language pairs as Temporal Concurrent Processes (TCP) via a process-wised distance metric learning framework. Specifically, we employ the soft Dynamic Time Warping (DTW) to measure the distance between two processes across modalities and then optimize the DTW costs. Meanwhile, we further introduce a regularization term that enforces the embeddings of each modality approximating a stochastic process to guarantee the inherent dynamics. Experimental results on three benchmarks demonstrate that TCP stands as a state-of-the-art method for various video-language understanding tasks, including paragraph-to-video retrieval, video moment retrieval, and video question-answering. Code is available at https://github.com/hengRUC/TCP.

Related Material

[pdf] [supp]
@InProceedings{Zhang_2023_ICCV, author = {Zhang, Heng and Liu, Daqing and Lv, Zezhong and Su, Bing and Tao, Dacheng}, title = {Exploring Temporal Concurrency for Video-Language Representation Learning}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {15568-15578} }