Structured Multi-Level Interaction Network for Video Moment Localization via Language Query
We address the problem of localizing a specific moment described by a natural language query. Existing works interact the query with either video frame or moment proposal, and neglect the inherent structure of moment construction for both cross-modal understanding and video content comprehension, which are the two crucial challenges for this task. In this paper, we disentangle the activity moment into boundary and content. Based on the explored moment structure, we propose a novel Structured Multi-level Interaction Network (SMIN) to tackle this problem through multi-levels of cross-modal interaction coupled with content-boundary-moment interaction. In particular, for cross-modal interaction, we interact the sentence-level query with the whole moment while interact the word-level query with content and boundary, as in a coarse-to-fine manner. For content-boundary-moment interaction, we capture the insightful relations between boundary, content, and the whole moment proposal. Through multi-level interactions, the model obtains robust cross-modal representation for accurate moment localization. Extensive experiments conducted on three benchmarks (i.e., Charades-STA, ActivityNet-Captions, and TACoS) demonstrate the proposed approach outperforms the state-of-the-art methods.