-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Qiu_2024_CVPR, author = {Qiu, Jielin and Zhu, Jiacheng and Han, William and Kumar, Aditesh and Mittal, Karthik and Jin, Claire and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Zhao, Ding and Li, Bo and Wang, Lijuan}, title = {MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {21909-21921} }
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
Abstract
Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Nonetheless numerous limitations exist within existing public MSMO datasets including insufficient maintenance data inaccessibility limited size and the absence of proper categorization which pose significant challenges. To address these challenges and provide a comprehensive dataset for this new direction we have meticulously curated the MMSum dataset. Our new dataset features (1) Human-validated summaries for both video and textual content providing superior human instruction and labels for multimodal learning. (2) Comprehensively and meticulously arranged categorization spanning 17 principal categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. (3) Benchmark tests performed on the proposed dataset to assess various tasks and methods including video summarization text summarization and multimodal summarization. To champion accessibility and collaboration we released the MMSum dataset and the data collection tool as fully open-source resources fostering transparency and accelerating future developments.
Related Material