- [pdf] [supp]
VirtualHome Action Genome: A Simulated Spatio-Temporal Scene Graph Dataset With Consistent Relationship Labels
Spatio-temporal scene graph generation is an essential task in household activity recognition that aims to identify human-object interactions. Constructing a dataset with per-frame object region and consistent relationship annotations requires extremely high labor costs. Existing datasets sparsely annotate frames sampled from videos, resulting in the lack of dense spatio-temporal correlation in videos. Additionally, existing datasets contain inconsistent relationship annotations, leading to the problem of learning ambiguous temporal associations. Moreover, existing datasets mainly discuss relationships that can be inferred from a single frame, ignoring the significance of temporal associations. To resolve those issues, we created a simulated dataset with per-frame consistent annotations and introduced a range of relationships requiring both spatial and temporal context. Most existing methods explore spatial correlations within single images and do not explicitly consider the dynamic changes across frames. Therefore, we proposed a tracking-based approach that explicitly grasps spatio-temporal human-object interactions while simultaneously localizing humans and objects. Our proposed approach achieved state-of-the-art performance on scene graph generation and outperformed existing methods in scene graph localization by large margins on the proposed dataset. Moreover, the experiments show the efficacy of pre-training on the proposed dataset while adapting to a previous benchmark consisting of real daily videos, indicating the potential of the proposed dataset in real-world scenarios.