Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C.V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 19383-19400

Abstract


We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g. sports music dance bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts yielding long-form captures from 1 to 42 minutes each and 1286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio eye gaze 3D point clouds camera poses IMU and multiple paired language descriptions---including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity we also present a suite of benchmark tasks and their annotations including fine-grained activity understanding proficiency estimation cross-view translation and 3D hand/body pose. All resources are open sourced to fuel new research in the community.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Grauman_2024_CVPR, author = {Grauman, Kristen and Westbury, Andrew and Torresani, Lorenzo and Kitani, Kris and Malik, Jitendra and Afouras, Triantafyllos and Ashutosh, Kumar and Baiyya, Vijay and Bansal, Siddhant and Boote, Bikram and Byrne, Eugene and Chavis, Zach and Chen, Joya and Cheng, Feng and Chu, Fu-Jen and Crane, Sean and Dasgupta, Avijit and Dong, Jing and Escobar, Maria and Forigua, Cristhian and Gebreselasie, Abrham and Haresh, Sanjay and Huang, Jing and Islam, Md Mohaiminul and Jain, Suyog and Khirodkar, Rawal and Kukreja, Devansh and Liang, Kevin J and Liu, Jia-Wei and Majumder, Sagnik and Mao, Yongsen and Martin, Miguel and Mavroudi, Effrosyni and Nagarajan, Tushar and Ragusa, Francesco and Ramakrishnan, Santhosh Kumar and Seminara, Luigi and Somayazulu, Arjun and Song, Yale and Su, Shan and Xue, Zihui and Zhang, Edward and Zhang, Jinxu and Castillo, Angela and Chen, Changan and Fu, Xinzhu and Furuta, Ryosuke and Gonzalez, Cristina and Gupta, Prince and Hu, Jiabo and Huang, Yifei and Huang, Yiming and Khoo, Weslie and Kumar, Anush and Kuo, Robert and Lakhavani, Sach and Liu, Miao and Luo, Mi and Luo, Zhengyi and Meredith, Brighid and Miller, Austin and Oguntola, Oluwatumininu and Pan, Xiaqing and Peng, Penny and Pramanick, Shraman and Ramazanova, Merey and Ryan, Fiona and Shan, Wei and Somasundaram, Kiran and Song, Chenan and Southerland, Audrey and Tateno, Masatoshi and Wang, Huiyu and Wang, Yuchen and Yagi, Takuma and Yan, Mingfei and Yang, Xitong and Yu, Zecheng and Zha, Shengxin Cindy and Zhao, Chen and Zhao, Ziwei and Zhu, Zhifan and Zhuo, Jeff and Arbelaez, Pablo and Bertasius, Gedas and Damen, Dima and Engel, Jakob and Farinella, Giovanni Maria and Furnari, Antonino and Ghanem, Bernard and Hoffman, Judy and Jawahar, C.V. and Newcombe, Richard and Park, Hyun Soo and Rehg, James M. and Sato, Yoichi and Savva, Manolis and Shi, Jianbo and Shou, Mike Zheng and Wray, Michael}, title = {Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {19383-19400} }