Learning to Visually Connect Actions and their Effects

Parmar, Paritosh; Peh, Eric; Fernando, Basura

Paritosh Parmar, Eric Peh, Basura Fernando; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1477-1487

Abstract

We introduce the novel concept of visually Connecting Actions and Their Effects (CATE) in video understanding. CATE can have applications in areas like task planning and learning from demonstration. We identify and explore two different aspects of the concept of CATE: Action Selection (AS) and Effect-Affinity Assessment (EAA) where video understanding models connect actions and effects at semantic and fine-grained levels respectively. We design various baseline models for AS and EAA. Despite the intuitive nature of the task we observe that models struggle and humans outperform them by a large margin. Our experiments show that in solving AS and EAA models learn intuitive properties like object tracking and encoding pose-related features without explicit supervision. We demonstrate that CATE can be an effective self-supervised task for learning video representations from unlabeled videos. The study aims to showcase the fundamental nature and versatility of CATE with the hope of inspiring advanced formulations and models.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Parmar_2025_WACV, author = {Parmar, Paritosh and Peh, Eric and Fernando, Basura}, title = {Learning to Visually Connect Actions and their Effects}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {1477-1487} }