BMRN: Boundary Matching and Refinement Network for Temporal Moment Localization With Natural Language
Temporal moment localization (TML) aims to retrieve the best moment in a video that matches a given sentence query. This task is challenging as it requires understanding the relationship between a video and a sentence, as well as the semantic meaning of both. TML methods using 2D temporal maps, which represent proposal features or scores on all moment proposals with the boundary of start and end times on the m and n axes, have shown performance improvements by modeling moment proposals in relation to each other. The methods, however, are limited by the coarsely pre-defined fixed boundaries of target moments, which depend on the length of training videos and the amount of memory available. To overcome this limitation, we propose a boundary matching and refinement network (BMRN) that generates 2D boundary matching and refinement maps along with a proposal feature map to obtain the final proposal score map. Our BMRN adjusts the fixed boundaries of moment proposals with predicted center and length offsets from boundary refinement maps. In addition, we introduce a length-aware proposal feature map that combines a cross-modal feature map and a similarity map between the predicted duration of the target moment and moment proposals. Our approach leads to improved TML performance on Charades-STA and ActivityNet Captions datasets, outperforming state-of-the-art methods by a large margin.