Exploring Group Video Captioning with Efficient Relational Approximation

Wang Lin, Tao Jin, Ye Wang, Wenwen Pan, Linjun Li, Xize Cheng, Zhou Zhao; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 15281-15290

Abstract


Current video captioning efforts most focus on describing a single video while the need for captioning videos in groups has increased considerably. In this study, we propose a new task, group video captioning, which aims to infer the desired content among a group of target videos and describe it with another group of related reference videos. This task requires the model to effectively summarize the target videos and accurately describe the distinguishing content compared to the reference videos, and it becomes more difficult as the video length increases. To solve this problem, 1) First, we propose an efficient relational approximation (ERA) to identify the shared content among videos while the complexity is linearly related to the number of videos. 2) Then, we introduce a contextual feature refinery with intra-group self-supervision to capture the contextual information and further refine the common properties. 3) In addition, we construct two group video captioning datasets derived from the YouCook2 and the ActivityNet Captions. The experimental results demonstrate the effectiveness of our method on this new task.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Lin_2023_ICCV, author = {Lin, Wang and Jin, Tao and Wang, Ye and Pan, Wenwen and Li, Linjun and Cheng, Xize and Zhao, Zhou}, title = {Exploring Group Video Captioning with Efficient Relational Approximation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {15281-15290} }