Telling Stories for Common Sense Zero-shot Action Recognition

Shreyank N Gowda, Laura Sevilla-Lara; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 4577-4594

Abstract


Video understanding has long suffered from reliance on large labeled datasets, motivating research into zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot video analysis, but constructing an effective semantic space relating action classes remains challenging. We address this by introducing a novel dataset, Stories, which contains rich textual descriptions for diverse action classes extracted from WikiHow articles. For each class, we extract multi-sentence narratives detailing the necessary steps, scenes, objects, and verbs that characterize the action. This contextual data enables modeling of nuanced relationships between actions, paving the way for zero-shot transfer. We also propose an approach that harnesses Stories to improve feature generation for training zero-shot classification. Without any target dataset fine-tuning, our method achieves new state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to 6.1%. We believe Stories provides a valuable resource that can catalyze progress in zero-shot action recognition. The textual narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled data that has long impeded advancements in this exciting domain.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Gowda_2024_ACCV, author = {Gowda, Shreyank N and Sevilla-Lara, Laura}, title = {Telling Stories for Common Sense Zero-shot Action Recognition}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {4577-4594} }