-
[pdf]
[supp]
[bibtex]@InProceedings{Qiu_2025_ICCV, author = {Qiu, Yue and Sun, Yanjun and Yagi, Takuma and Egami, Shusaku and Miyata, Natsuki and Fukuda, Ken and Hara, Kensho and Sagawa, Ryusuke}, title = {VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {12242-12252} }
VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos
Abstract
Recognizing subtle similarities and differences among sets of similar activities is central to many real-world applications, including skill acquisition, sports performance evaluation, and anomaly detection. Humans excel at such fine-grained analysis, which requires comprehensive video understanding and cross-video reasoning about action attributes, poses, positions, and emotional states. Yet existing video-based large language models typically address only single-video recognition, leaving their capacity for multi-video reasoning largely unexplored. We introduce VideoSetDiff, a curated dataset designed to test detail-oriented recognition across diverse activities, from subtle action attributes to viewpoint transitions. Our evaluation of current video-based LLMs on VideoSetDiff reveals critical shortcomings, particularly in fine-grained detail recognition and multi-video reasoning. To mitigate these issues, we propose an automatically generated dataset for instruction tuning alongside a novel multi-video recognition framework. While instruction tuning and specialized multi-video reasoning improve performance, all tested models remain far from satisfactory. These findings underscore the need for more robust video-based LLMs capable of handling complex multi-video tasks, enabling diverse real-world applications.
Related Material
