PHGC: Procedural Heterogeneous Graph Completion for Natural Language Task Verification in Egocentric Videos

Jiang, Xun; Huang, Zhiyi; Xu, Xing; Song, Jingkuan; Shen, Fumin; Shen, Heng Tao

Xun Jiang, Zhiyi Huang, Xing Xu, Jingkuan Song, Fumin Shen, Heng Tao Shen; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 8615-8624

Abstract

Natural Language-based Egocentric Task Verification (NLETV) aims to equip agents to determine if operation flows of procedural tasks in egocentric videos align with natural language instructions. Describing rules with natural language provides generalizable applications, but also raises cross-modal heterogeneity and hierarchical misalignment challenges. In this paper, we proposed a novel approach termed Procedural Heterogeneous Graph Completion (PHGC), which addresses these challenges with heterogeneous graphs representing the logic in rules and operation flows. Specifically, our PHGC method mainly consists of three key components: (1) Heterogeneous Graph Construction module that defines objective states and operation flows as vertices, with temporal and sequential relations as edges. (2) Cross-Modal Path Finding module that aligns semantic relations between hierarchical video and text elements. (3) Discriminative Entity Representation module excavates hidden entities that integrate general logical relations and discriminative cues to reveal final verification results. Additionally, we further constructed a new dataset called CSV-NL comprised of realistic videos. Extensive experiments on the two benchmark datasets covering both digital and physical scenarios, i.e., EgoTV and CSV-NL, demonstrate that our proposed PHGC establishes state-of-the-art performance across different settings. Our code and dataset are available at https://github.com/XunCHN/PHGC.

Related Material

[pdf]

[bibtex]

@InProceedings{Jiang_2025_CVPR, author = {Jiang, Xun and Huang, Zhiyi and Xu, Xing and Song, Jingkuan and Shen, Fumin and Shen, Heng Tao}, title = {PHGC: Procedural Heterogeneous Graph Completion for Natural Language Task Verification in Egocentric Videos}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {8615-8624} }