Interaction Region Visual Transformer for Egocentric Action Anticipation

Debaditya Roy, Ramanathan Rajendiran, Basura Fernando; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 6740-6750

Abstract


Human-object interaction (HOI) and temporal dynamics along the motion paths are the most important visual cues for egocentric action anticipation. Especially, interaction regions covering objects and the human hand reveal significant visual cues to predict future human actions. However, how to incorporate and capture these important visual cues in modern video Transformer architecture remains a challenge, especially because integrating inductive biases into Transformers is hard. We leverage the effective MotionFormer that models motion dynamics to incorporate interaction regions using spatial cross-attention and further infuse contextual information using trajectory cross-attention to obtain an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. On the EK100 evaluation server, InAViT is on top of the public leader board (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall. We will release the code.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Roy_2024_WACV, author = {Roy, Debaditya and Rajendiran, Ramanathan and Fernando, Basura}, title = {Interaction Region Visual Transformer for Egocentric Action Anticipation}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {6740-6750} }