Listen Then See: Video Alignment with Speaker Attention

Aviral Agrawal, Carlos Mateo Samudio Lezcano, Iqui Balam Heredia-Marin, Prabhdeep Singh Sethi; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2018-2027

Abstract


Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding temporal reasoning and the integration of multimodal information but in addition it requires processing nuanced human behavior. Furthermore the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus there is a need to help the task's secondary modalities and work in tandem with the primary modality. In this work we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results 81.1% accuracy on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality leading to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Agrawal_2024_CVPR, author = {Agrawal, Aviral and Lezcano, Carlos Mateo Samudio and Heredia-Marin, Iqui Balam and Sethi, Prabhdeep Singh}, title = {Listen Then See: Video Alignment with Speaker Attention}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2018-2027} }