Explainable Video Entailment With Grounded Visual Evidence

Junwen Chen, Yu Kong; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2021-2030

Abstract


Video entailment aims at determining if a hypothesis textual statement is entailed or contradicted by a premise video. The main challenge of video entailment is that it requires fine-grained reasoning to understand the complex and long story-based videos. To this end, we propose to incorporate visual grounding to the entailment by explicitly linking the entities described in the statement to the evidence in the video. If the entities are grounded in the video, we enhance the entailment judgment by focusing on the frames where the entities occur. Besides, in entailment dataset, the real/fake statements are formed in pairs with subtle discrepancy, which allows an add-on explanation module to predict which words or phrases make the statement contradictory to the video and regularize the training of the entailment judgment. Experimental results demonstrate that our approach significantly outperforms the state-of-the-art methods.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Chen_2021_ICCV, author = {Chen, Junwen and Kong, Yu}, title = {Explainable Video Entailment With Grounded Visual Evidence}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {2021-2030} }