Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

Arsha Nagrani, Jasper Uijlings, Shyamal Buch, Tobias Weyand, Sudheendra Vijayanarasimhan, Bo Hu, Ramin Mehran, David A Ross, Cordelia Schmid; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 38859-38869

Abstract


Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of inter- mediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a bench- mark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatio-temporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and when to look yields substantial improvements in performance.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Nagrani_2026_CVPR, author = {Nagrani, Arsha and Uijlings, Jasper and Buch, Shyamal and Weyand, Tobias and Vijayanarasimhan, Sudheendra and Hu, Bo and Mehran, Ramin and A Ross, David and Schmid, Cordelia}, title = {Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {38859-38869} }