Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning

Heng Zhang, Daqing Liu, Qi Zheng, Bing Su; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2225-2234

Abstract


A meaningful video is semantically coherent and changes smoothly. However, most existing fine-grained video representation learning methods learn frame-wise features by aligning frames across videos or exploring relevance between multiple views, neglecting the inherent dynamic process of each video. In this paper, we propose to learn video representations by modeling Video as Stochastic Processes (VSP) via a novel process-based contrastive learning framework, which aims to discriminate between video processes and simultaneously capture the temporal dynamics in the processes. Specifically, we enforce the embeddings of the frame sequence of interest to approximate a goal-oriented stochastic process, i.e., Brownian bridge, in the latent space via a process-based contrastive loss. To construct the Brownian bridge, we adapt specialized sampling strategies under different annotations for both self-supervised and weakly-supervised learning. Experimental results on four datasets show that VSP stands as a state-of-the-art method for various video understanding tasks, including phase progression, phase classification and frame retrieval. Code is available at 'https://github.com/hengRUC/VSP'.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Zhang_2023_CVPR, author = {Zhang, Heng and Liu, Daqing and Zheng, Qi and Su, Bing}, title = {Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023}, pages = {2225-2234} }