Precise Action-to-Video Generation Through Visual Action Prompts

Wang, Yuang; Wen, Chao; Guo, Haoyu; Peng, Sida; Qin, Minghan; Bao, Hujun; Zhou, Xiaowei; Hu, Ruizhen

Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, Ruizhen Hu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 12713-12724

Abstract

We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality tradeoff: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for its generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources -- human-object interactions (HOI) and dexterous robotic manipulation -- enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Wang_2025_ICCV, author = {Wang, Yuang and Wen, Chao and Guo, Haoyu and Peng, Sida and Qin, Minghan and Bao, Hujun and Zhou, Xiaowei and Hu, Ruizhen}, title = {Precise Action-to-Video Generation Through Visual Action Prompts}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {12713-12724} }