-
[pdf]
[bibtex]@InProceedings{Roy_2025_WACV, author = {Roy, Anirban and Cobb, Adam and Kaur, Ramneet and Jha, Sumit and Bastian, Nathaniel and Berenbeim, Alexander and Thomson, Robert and Cruickshank, Iain and Velasquez, Alvaro and Jha, Susmit}, title = {Zero-Shot Detection of Out-of-Context Objects using Foundation Models}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {9168-9177} }
Zero-Shot Detection of Out-of-Context Objects using Foundation Models
Abstract
We address the problem of detecting out-of-context (OOC) objects in a scene. Given an image we aim to detect whether the image has objects that are not present in their usual context and localize such OOC objects. Existing approaches for OOC detection rely on defining the common context in terms of the manually constructed features such as the co-occurrence of objects spatial relations between objects and shape and size of the objects and then learning such context for a given dataset. But context is often nuanced ranging from very common to very surprising. Further learned context from specific datasets may not be generalized as datasets may not truly represent the human notion of what is in context. Motivated by the success of large language models and more generally foundation models (FMs) in common sense reasoning we investigate the FM's ability to capture a more generalized notion of context. We find that a pre-trained FM such as GPT-4 provides a more nuanced notion of OOC and enables zero-shot OOC detection when coupled with other pre-trained FMs for caption generation such as BLIP-2 and image in-painting with Stable Diffusion 2.0. Our approach does not need any dataset-specific training. We demonstrate the efficacy of our approach on two OOC object detection datasets achieving 90.8% zero-shot accuracy on the MIT-OOC dataset and 87.26% on the IJCAI22-COCO-OOC dataset.
Related Material