NeuroViG - Integrating Event Cameras for Resource-Efficient Video Grounding

Dulanga Weerakoon, Vigneshwaran Subbaraju, Joo Hwee Lim, Archan Misra; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 5781-5790

Abstract


Spatio-Temporal Video Grounding (STVG) - the task of identifying the target object in the field-of-view that the language instruction refers to - is a fundamental vision-language task. Current STVG approaches typically utilize feeds from an RGB camera that is assumed to be always-on and process the video frames using complex neural network pipelines. As a result they often impose prohibitive system overheads (energy latency) on pervasive devices. To address this we propose NeuroViG with two key innovations: (a) leveraging on event streams from a low-power neuromorphic event camera sensor to perform selective triggering of the more energy-hungry RGB camera for STVG and (b) augmenting the STVG model with a lightweight Adaptive Frame Selector (AFS) that bypasses complex transformer-based operations for a majority of video frames thereby enabling its execution on a pervasive Jetson AGX device. We have also introduced modifications to the neural network processing pipeline such that the system can offer tunable tradeoffs between accuracy and energy/latency. Our proposed NeuroViG system allows us to reduce the STVG energy overhead and latency by 4x and 3.8x respectively for less than 1% loss in accuracy.

Related Material


[pdf]
[bibtex]
@InProceedings{Weerakoon_2025_WACV, author = {Weerakoon, Dulanga and Subbaraju, Vigneshwaran and Lim, Joo Hwee and Misra, Archan}, title = {NeuroViG - Integrating Event Cameras for Resource-Efficient Video Grounding}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {5781-5790} }