ViHOI: Human-Object Interaction Synthesis with Visual Priors

Cai, Songjin; Zhong, Linjie; Guo, Ling; Ding, Changxing

Songjin Cai, Linjie Zhong, Ling Guo, Changxing Ding; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 30686-30695

Abstract

Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM's high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization. The code for this work will be released at https//github.com/MPI-Lab/ViHOI.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Cai_2026_CVPR, author = {Cai, Songjin and Zhong, Linjie and Guo, Ling and Ding, Changxing}, title = {ViHOI: Human-Object Interaction Synthesis with Visual Priors}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {30686-30695} }