Enhancing Embodied Object Detection with Spatial Feature Memory

Nicolas Harvey Chapman, Christopher Lehnert, Will Browne, Feras Dayoub; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 6921-6931

Abstract


Deep-learning and large scale language-image training have produced image object detectors that generalise well to diverse environments and semantic classes. However existing object detection paradigms are not optimally tailored for the embodied conditions inherent in robotics where the same objects are repeatedly observed over time. In this setting detectors that operate on single images or short sequences are likely to produce inconsistent predictions. Motivated by this we explore if the embodiment of the detector can be utilised to generate more consistent and reliable detections during repeat observation of a scene. We propose a novel framework that incrementally updates a spatial feature memory while using it as a prior to perform image object detection. By leveraging the embodiment of the robot in this way raw object detection performance is enhanced by up to 4.12 mAP and downstream robotic tasks such as semantic mapping and object recall are improved. We also investigate the structure this spatial memory should take leading to an implementation that aggregates features from the shared language-image embedding space. This approach allows the detector to effectively balance the use of memory and image features while ensuring that the benefits of language-image pre-training can be enjoyed alongside our spatial memory.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Chapman_2025_WACV, author = {Chapman, Nicolas Harvey and Lehnert, Christopher and Browne, Will and Dayoub, Feras}, title = {Enhancing Embodied Object Detection with Spatial Feature Memory}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6921-6931} }