Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments

Difei Gao, Ruiping Wang, Ziyi Bai, Xilin Chen; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1675-1685

Abstract


Visual understanding goes well beyond the study of images or videos on the web. To achieve complex tasks in volatile situations, the human can deeply understand the environment, quickly perceive events happening around, and continuously track objects' state changes, which are still challenging for current AI systems. To equip AI system with the ability to understand dynamic ENVironments, we build a video Question Answering dataset named Env-QA. Env-QA contains 23K egocentric videos, where each video is composed of a series of events about exploring and interacting in the environment. It also provides 85K questions to evaluate the ability of understanding the composition, layout, and state changes of the environment presented by the events in videos. Moreover, we propose a video QA model, Temporal Segmentation and Event Attention network (TSEA), which introduces event-level video representation and corresponding attention mechanisms to better extract environment information and answer questions. Comprehensive experiments demonstrate the effectiveness of our framework and show the formidable challenges of Env-QA in terms of long-term state tracking, multi-event temporal reasoning and event counting, etc.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Gao_2021_ICCV, author = {Gao, Difei and Wang, Ruiping and Bai, Ziyi and Chen, Xilin}, title = {Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {1675-1685} }