Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal(University of Bristol), Bikram Boote(University of Illinois Urbana-Champaign), Eugene H. Byrne, Zach Chavis(University of Minnesota System), Joya Chen(National University of Singapore), Feng Cheng, Fu-Jen Chu, Sean Crane(Carnegie Mellon University), Avijit Dasgupta(International Institute of Information Technology, Hyderabad), Jing Dong, María Escobar(Universidad de Los Andes), Cristhian Forigua(Universidad de Los Andes), Abrham Gebreselasie(Carnegie Mellon University), Sanjay Haresh(Simon Fraser University), Jing Huang, Md. Mohaiminul Islam(University of North Carolina at Chapel Hill), Suyog Dutt Jain, Rawal Khirodkar(Carnegie Mellon University), Devansh Kukreja, Kevin J Liang, Jia-Wei Liu(National University of Singapore), Sagnik Majumder, Yongsen Mao(Simon Fraser University), Miguel Martín, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa(University of Catania), Santhosh Kumar Ramakrishnan(The University of Texas at Austin), Luigi Seminara(University of Catania), Arjun Somayazulu(The University of Texas at Austin), Yale Song, Shan Su(California University of Pennsylvania), Zihui Xue, Edward Zhang(California University of Pennsylvania), Jinxu Zhang(California University of Pennsylvania), Angela Castillo(Universidad de Los Andes), Changan Chen(The University of Texas at Austin), Xinzhu Fu(National University of Singapore), Ryosuke Furuta(The University of Tokyo), Cristina González(Universidad de Los Andes), Prince Gupta, Jiabo Hu, Yifei Huang(California University of Pennsylvania), Yiming Huang(California University of Pennsylvania), Weslie Khoo(Indiana University), Anush Kumar(University of Minnesota System), Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo(The University of Texas at Austin), Zhengyi Luo(Carnegie Mellon University), B. D. Meredith, Austin Miller, Oluwatumininu Oguntola(University of North Carolina at Chapel Hill), Xiaqing Pan, Penny Peng, Shraman Pramanick(Johns Hopkins University), Merey Ramazanova(King Abdullah University of Science and Technology), Fiona Ryan(Georgia Institute of Technology), W. Shan(University of North Carolina at Chapel Hill), Kiran Somasundaram, Chenan Song(National University of Singapore), Audrey Southerland(Georgia Institute of Technology), Masatoshi Tateno(The University of Tokyo), Huiyu Wang, Yuchen Wang(Indiana University), Takuma Yagi(The University of Tokyo), Mingfei Yan, Xitong Yang, Zecheng Yu(The University of Tokyo), Shengxin Zha, Chen Zhao(King Abdullah University of Science and Technology), Ziwei Zhao(Indiana University), Zhifan Zhu(University of Bristol), Jeff Zhuo(University of North Carolina at Chapel Hill), Pablo Arbeláez(Universidad de Los Andes), Gedas Bertasius(University of North Carolina at Chapel Hill), Dima Damen(University of Bristol), Jakob Engel, Giovanni Maria Farinella(University of Catania), Antonino Furnari(University of Catania), Bernard Ghanem(King Abdullah University of Science and Technology), Judy Hoffman(Georgia Institute of Technology), C. V. Jawahar(International Institute of Information Technology, Hyderabad), Richard Newcombe, Hyun Soo Park(University of Minnesota System), James M. Rehg(University of Illinois Urbana-Champaign), Yoichi Sato(The University of Tokyo), Manolis Savva(Simon Fraser University), Jianbo Shi(California University of Pennsylvania), Mike Zheng Shout(National University of Singapore), Michael Wray(University of Bristol)
Unknown
June 16, 2024
Cited by 84

Abstract

We present Ego-Exo4D, a diverse, large-scale multi-modal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured ego-centric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is un-precedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions-including a novel “expert commentary” done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community.


Related Papers

No related papers found

Powered by citation graph analysis