-
[pdf]
[supp]
[bibtex]@InProceedings{Ozsoy_2025_CVPR, author = {\"Ozsoy, Ege and Holm, Felix and Pellegrini, Chantal and Czempiel, Tobias and Saleh, Mahdi and Navab, Nassir and Busam, Benjamin}, title = {Location-Free Scene Graph Generation}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {108-117} }
Location-Free Scene Graph Generation
Abstract
Scene Graph Generation (SGG) is a visual understanding task that describes a scene as a graph of entities and their relationships, traditionally relying on spatial labels like bounding boxes or segmentation masks. These requirements increase annotation costs and complicate integration with other modalities where spatial synchronization may be unavailable. In this work, we investigate the feasibility and effectiveness of scene graphs without location information, offering an alternative paradigm for scenarios where spatial data is unavailable. To this end, we propose the first method to generate location-free scene graphs, directly from images, evaluate their correctness and show the usefulness of such location-free scene graphs in several downstream tasks. Our proposed method, Pix2SG, models scene graph generation as an autoregressive sequence modeling task, predicting all instances and their relations as one output sequence. To enable evaluation without location matching, we propose a heuristic tree search algorithm that matches predicted scene graphs with ground truth graphs, bypassing the need for location-based metrics. We demonstrate the effectiveness of location-free scene graphs on three benchmark datasets and two downstream tasks -- image retrieval and visual question showing they can achieve competitive performance with significantly less annotations. Our findings suggest that location-free scene graphs can still be generated and utilized effectively without location information, thus opening new avenues for scalable, structured and efficient visual representations, such as for multimodal scene understanding by reducing dependency on modality-specific annotations. The code will be made available upon acceptance.
Related Material