SnAG: Scalable and Accurate Video Grounding

Mu, Fangzhou; Mo, Sicheng; Li, Yin

Fangzhou Mu, Sicheng Mo, Yin Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18930-18940

Abstract

Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability --- they have been optimized for grounding only a few text queries within short videos and fail to scale up to long videos with hundreds of queries. In this paper we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover it leads us to a novel video-centric sampling scheme for efficient training. Based on these findings we present SnAG a simple baseline for scalable and accurate video grounding. Without bells and whistles SnAG is 43% more accurate and 1.5x faster than CONE a state of the art for long-form video grounding on the challenging MAD dataset while achieving highly competitive results on short videos.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Mu_2024_CVPR, author = {Mu, Fangzhou and Mo, Sicheng and Li, Yin}, title = {SnAG: Scalable and Accurate Video Grounding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18930-18940} }