Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18198-18208

Abstract


Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g. objects scenes atomic actions). However most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos starting from clip-level captions describing atomic actions then focusing on segment-level descriptions and concluding with generating summaries for hour-long videos. Furthermore we introduce Ego4D-HCap dataset by augmenting Ego4D with 8267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks such as VideoQA on EgoSchema. Data code and models are publicly available at https://sites.google.com/view/vidrecap.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Islam_2024_CVPR, author = {Islam, Md Mohaiminul and Ho, Ngan and Yang, Xitong and Nagarajan, Tushar and Torresani, Lorenzo and Bertasius, Gedas}, title = {Video ReCap: Recursive Captioning of Hour-Long Videos}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18198-18208} }