-
[pdf]
[bibtex]@InProceedings{Patsch_2025_ICCV, author = {Patsch, Constantin and Goter, Jaden and Greer, Joseph and Ma, Lingni and Sodhi, Raj}, title = {WACU: Multi-Modal Wristband Assistant for Contextual Understanding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {7273-7282} }
WACU: Multi-Modal Wristband Assistant for Contextual Understanding
Abstract
Recent advances in language modeling and multimodal learning have significantly enhanced the capabilities of perception systems. While many existing approaches emphasize egocentric setups using smart glasses to perceive human actions, the emergence and adoption of wearable devices, such as smartwatches and fitness trackers, enable new ways for interaction understanding from a wrist-worn perspective. Within this work, we rely on our wristband prototype, which can capture video, audio, and imu data from a wrist perspective. Traditional single-modality systems often face challenges such as missing modalities or interference during inference, which can degrade the accuracy of interaction context capture. To address these issues, we introduce a multimodal Large Language Model (LLM) based system that integrates video, audio, and IMU data, which is designed to generate comprehensive multimodal captions that effectively describe human interactions more robustly compared to relying on only individual modalities. Evaluation on public datasets and our user study demonstrate the multimodal contextual understanding capabilities of our approach and underscore the importance of sensor fusion for accurate hand-centric interaction captioning, especially when modalities are missing.
Related Material
