Context-Aware Scene Graph Generation With Seq2Seq Transformers

Yichao Lu, Himanshu Rai, Jason Chang, Boris Knyazev, Guangwei Yu, Shashank Shekhar, Graham W. Taylor, Maksims Volkovs; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 15931-15941

Abstract


Scene graph generation is an important task in computer vision aimed at improving the semantic understand- ing of the visual world. In this task, the model needs to detect objects and predict visual relationships between them. Most of the existing models predict relationships in parallel assuming their independence. While there are differ- ent ways to capture these dependencies, we explore a conditional approach motivated by the sequence-to-sequence (Seq2Seq) formalism. Different from the previous research, our proposed model predicts visual relationships one at a time in an autoregressive manner by explicitly conditioning on the already predicted relationships. Drawing from translation models in NLP, we propose an encoder- decoder model built using Transformers where the encoder captures global context and long range interactions. The decoder then makes sequential predictions by conditioning on the scene graph constructed so far. In addition, we introduce a novel reinforcement learning-based training strategy tailored to Seq2Seq scene graph generation. By using a self-critical policy gradient training approach with Monte Carlo search we directly optimize for the (mean) recall metrics and bridge the gap between training and evaluation. Experimental results on two public benchmark datasets demonstrate that our Seq2Seq learning approach achieves strong empirical performance, out- performing previous state-of-the-art, while remaining efficient in terms of training and inference time. Full code for this work is available here: https://github.com/ layer6ai-labs/SGG-Seq2Seq.

Related Material


[pdf]
[bibtex]
@InProceedings{Lu_2021_ICCV, author = {Lu, Yichao and Rai, Himanshu and Chang, Jason and Knyazev, Boris and Yu, Guangwei and Shekhar, Shashank and Taylor, Graham W. and Volkovs, Maksims}, title = {Context-Aware Scene Graph Generation With Seq2Seq Transformers}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {15931-15941} }