Pre-trained Bidirectional Dynamic Memory Network For Long Video Question Answering

Jinmeng Wu, Pengcheng Shu, Hanyu Hong, Lei Ma, Ying Zhu, Lei Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 5550-5557

Abstract


Video question and answering is an important problem that receives extensive research interests in past few years. Although visual language models have seen recent success in several multimodal tasks they still face challenges when it comes to complex reasoning in long movies that involve multiple person-object interaction events. Humans tackle these issues by utilizing a sequence of episodic memories as reference points to swiftly identify crucial times that are pertinent to the task at hand. In order to emulate this successful method of reasoning we propose a pre-trained bidirectional dynamic memory model for long video question answering. In the feature extraction stage we use a pre-trained CLIP model to generate video features and question features. Then we use a bidirectional matching memory network to acquire events memories. In the features fusion stage the events memories will be used as connecting points to establish associations between the high-level event concepts and the low-level redundant video content using the spatio-temporal self-attention. When dealing with the problem our model first prioritizes the generated key event memories and then considers the most relevant event memories. Our contribution addresses the difficulty of effectively matching video features and question features in complex video contexts in long video question answering and where features are easily missing under long temporal video sequences. Our experiments on two public video question answering datasets which proves that our model has good performance in long video question answering.

Related Material


[pdf]
[bibtex]
@InProceedings{Wu_2024_CVPR, author = {Wu, Jinmeng and Shu, Pengcheng and Hong, Hanyu and Ma, Lei and Zhu, Ying and Wang, Lei}, title = {Pre-trained Bidirectional Dynamic Memory Network For Long Video Question Answering}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {5550-5557} }