Dynamic Scene Graph Generation via Anticipatory Pre-Training

Yiming Li, Xiaoshan Yang, Changsheng Xu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 13874-13883

Abstract


Humans can not only see the collection of objects in visual scenes, but also identify the relationship between objects. The visual relationship in the scene can be abstracted into the semantic representation of triple <subject, predicate, object> and thus results in a scene graph, which can convey a lot of information for visual understanding. Due to the motion of objects, the visual relationship between two objects in videos may vary, which makes the task of dynamically generating scene graphs from videos more complicated and challenging than the conventional image-based static scene graph generation. Inspired by the ability of humans to infer the visual relationship, we propose a novel anticipatory pre-training paradigm based on Transformer to explicitly model the temporal correlation of visual relationships in different frames to improve dynamic scene graph generation. In pre-training stage, the model predicts the visual relationships of current frame based on the previous frames by extracting intra-frame spatial information with a spatial encoder and inter-frame temporal correlations with a temporal encoder. In the fine-tuning stage, we reuse the spatial encoder and the temporal decoder and combine the information of the current frame to predict the visual relationship. Extensive experiments demonstrate that our method achieves state-of-the-art performance on Action Genome dataset.

Related Material


[pdf]
[bibtex]
@InProceedings{Li_2022_CVPR, author = {Li, Yiming and Yang, Xiaoshan and Xu, Changsheng}, title = {Dynamic Scene Graph Generation via Anticipatory Pre-Training}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {13874-13883} }