A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives

Simone Alberto Peirone, Francesca Pistilli, Antonio Alliegro, Giuseppe Averta; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18275-18285

Abstract


Human comprehension of a video stream is naturally broad: in a few instants we are able to understand what is happening the relevance and relationship of objects and forecast what will follow in the near future everything all at once. We believe that - to effectively transfer such an holistic perception to intelligent machines - an important role is played by learning to correlate concepts and to abstract knowledge coming from different tasks to synergistically exploit them when learning novel skills. To accomplish this we look for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead to support multiple downstream tasks and enable cooperation when learning novel skills. We then propose EgoPack a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights as a backpack of skills that a robot can carry around and use when needed. We demonstrate the effectiveness and efficiency of our approach on four Ego4D benchmarks outperforming current state-of-the-art methods. Project webpage: https://sapeirone.github.io/EgoPack.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Peirone_2024_CVPR, author = {Peirone, Simone Alberto and Pistilli, Francesca and Alliegro, Antonio and Averta, Giuseppe}, title = {A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18275-18285} }