Visual Relation Diffusion for Human-Object Interaction Detection

Ping Cao, Yepeng Tang, Chunjie Zhang, Xiaolong Zheng, Chao Liang, Yunchao Wei, Yao Zhao; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 23551-23560

Abstract


Human-object interaction (HOI) detection relies on fine-grained visual understanding to distinguish complex relationships between humans and objects. While recent generative diffusion models have demonstrated remarkable capability in learning detailed visual concepts through pixel-level generation, their potential for interaction-level relationship modeling remains largely unexplored. To bridge this gap, we propose a Visual Relation Diffusion model (VRDiff), which introduces dense visual relation conditions to guide interaction understanding. Specifically, we encode interaction-aware condition representations that capture both spatial responsiveness and contextual semantics of human-object pairs, conditioning the diffusion process purely on visual features rather than text-based inputs. Furthermore, we refine these relation representations through generative feedback from the diffusion model, enhancing HOI detection without requiring image synthesis. Extensive experiments on the HICO-DET benchmark demonstrate that VRDiff achieves competitive results under both standard and zero-shot HOI detection settings.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Cao_2025_ICCV, author = {Cao, Ping and Tang, Yepeng and Zhang, Chunjie and Zheng, Xiaolong and Liang, Chao and Wei, Yunchao and Zhao, Yao}, title = {Visual Relation Diffusion for Human-Object Interaction Detection}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {23551-23560} }