Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Chen, Tsai-Shien; Siarohin, Aliaksandr; Menapace, Willi; Deyneka, Ekaterina; Chao, Hsiang-wei; Jeon, Byung Eun; Fang, Yuwei; Lee, Hsin-Ying; Ren, Jian; Yang, Ming-Hsuan; Tulyakov, Sergey

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13320-13331

Abstract

The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs high-quality video-text data is much harder to collect. First of all manual labeling is more time-consuming as it requires an annotator to watch an entire video. Second videos have a temporal dimension consist of a number of scenes stacked together and show multiple actions. Accordingly to establish a video dataset with high-quality captions we propose an automatic approach leveraging multimodal inputs such as textual video description subtitles and individual video frames. Specifically we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips and apply multiple cross-modality teacher models to obtain captions for each video. Next we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning video and text retrieval and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Chen_2024_CVPR, author = {Chen, Tsai-Shien and Siarohin, Aliaksandr and Menapace, Willi and Deyneka, Ekaterina and Chao, Hsiang-wei and Jeon, Byung Eun and Fang, Yuwei and Lee, Hsin-Ying and Ren, Jian and Yang, Ming-Hsuan and Tulyakov, Sergey}, title = {Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13320-13331} }