Forecasting 3D Scanpaths in Egocentric Video

Fiona Ryan, Ishwarya Ananthabhotla, Yijun Qian, Judy Hoffman, James M. Rehg, Vamsi Krishna Ithapu, Calvin Murdock; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 42824-42835

Abstract


Forecasting gaze behavior is an important task for understanding user intent and creating AR/VR systems that can anticipate where users will look and interact next. While prior works have addressed predicting scanpaths in static images, forecasting gaze in egocentric videos presents new challenges due to the dynamic nature of the scene and the camera wearer's continuous movement through the 3D environment. To address these challenges, we formulate the novel task of egocentric scanpath prediction as forecasting a sequence of future fixations in 3D Cartesian coordinates relative to the last observed camera pose, producing a 3D scanpath that is grounded in the environment. We propose a transformer architecture that leverages egocentric video frames, head pose, and past 3D gaze observations to predict future 3D fixation sequences. We evaluate our method on the Aria Digital Twin dataset. Our findings establish a baseline for the novel task of 3D scanpath prediction and highlight important architectural elements for our task.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ryan_2026_CVPR, author = {Ryan, Fiona and Ananthabhotla, Ishwarya and Qian, Yijun and Hoffman, Judy and Rehg, James M. and Ithapu, Vamsi Krishna and Murdock, Calvin}, title = {Forecasting 3D Scanpaths in Egocentric Video}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {42824-42835} }