IntentQA: Context-aware Video Intent Reasoning

Jiapeng Li, Ping Wei, Wenjuan Han, Lifeng Fan; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 11963-11974

Abstract


In this paper, we propose a novel task IntentQA, a special VideoQA task focusing on video intent reasoning, which has become increasingly important for AI with its advantages in equipping AI agents with the capability of reasoning beyond mere recognition in daily tasks. We also contribute a large-scale VideoQA dataset for this task. We propose a Context-aware Video Intent Reasoning model (CaVIR) consisting of i) Video Query Language (VQL) for better cross-modal representation of the situational context, ii) Contrastive Learning module for utilizing the contrastive context, and iii) Commonsense Reasoning module for incorporating the commonsense context. Comprehensive experiments on this challenging task demonstrate the effectiveness of each model component, the superiority of our full model over other baselines, and the generalizability of our model to a new VideoQA task. The dataset and codes are open-sourced at: https://github.com/JoseponLee/IntentQA.git

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Li_2023_ICCV, author = {Li, Jiapeng and Wei, Ping and Han, Wenjuan and Fan, Lifeng}, title = {IntentQA: Context-aware Video Intent Reasoning}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {11963-11974} }