HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

Masatoshi Tateno, Gido Kato, Hirokatsu Kataoka, Yoichi Sato, Takuma Yagi; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 3455-3465

Abstract


Hand-object interaction (HOI) involves dynamics where human manipulations produce spatio-temporal effects on objects. However, existing semantic HOI benchmarks focus on either manipulation or effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture HOI dynamics. We introduce HanDyVQA, a fine-grained video QA benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totaling 11.1K multiple-choice QA pairs. Collected QA pairs require recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best, Gemini-2.5-Pro, achieved only 73% accuracy, well below human performance (97%). Further analysis shows the remaining challenges in spatial relationships, motion, and part-level geometric understanding. We also found that incorporating explicit HOI cues into visual features improves performance, providing insights for future HOI-aware models.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Tateno_2026_CVPR, author = {Tateno, Masatoshi and Kato, Gido and Kataoka, Hirokatsu and Sato, Yoichi and Yagi, Takuma}, title = {HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {3455-3465} }