Seeing the Unseen: Predicting the First-Person Camera Wearer's Location and Pose in Third-Person Scenes
Our goal is to predict the camera wearer's location and pose in his/her environment based on what's captured by the camera wearer's first-person wearable camera. Toward this goal, we first collect a new dataset in which the camera wearer performs various activities (e.g., opening a fridge, reading a book) in different scenes with time-synchronized first-person and stationary third-person cameras. We then propose a novel deep network architecture, which takes as input the first-person video frames and empty third-person scene image (without the camera wearer) to predict the location and pose of the camera wearer. We explore and compare our approach with several intuitive baselines and show initial promising results on this novel, challenging problem.