Efficiently Mitigating Video Content Misalignment on Large Vision Model with Time-Series Data Alignment

Xie, Hanchen; Ma, Rose; Zhu, Jiageng; Mai, Zheda; Abd-Almageed, Wael; Abraham, Zubin

Hanchen Xie, Rose Ma, Jiageng Zhu, Zheda Mai, Wael Abd-Almageed, Zubin Abraham; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops, 2025, pp. 3301-3307

Abstract

Video understanding tasks tend to demand considerable computation and storage. Particularly as large vision models (LVMs) demonstrate increasing performance gains due to their large model size and a tremendous amount of training data. To achieve optimal task performance with minimal cost, LVMs can be trained with only the selected critical content during the training. However, during the inference, the LVMs will consume the entire lengthy videos without guidance. Such video content misalignment can jeopardize the LVM performance. Additionally, conventional efficient video understanding methods may be restricted due to a strong data assumption that may not be feasible when leveraging LVMs. Thus, we will first provide a preliminary study on the impact of such a misalignment challenge. Then, we will introduce a simple and efficient framework that leverages non-visual modalities to align the training and inference video content and use time-series data as an example to implement the framework. Experimental results on the Ego4D-derived dataset demonstrate the promising potential of the framework.

Related Material

[pdf]

[bibtex]

@InProceedings{Xie_2025_CVPR, author = {Xie, Hanchen and Ma, Rose and Zhu, Jiageng and Mai, Zheda and Abd-Almageed, Wael and Abraham, Zubin}, title = {Efficiently Mitigating Video Content Misalignment on Large Vision Model with Time-Series Data Alignment}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3301-3307} }