GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 6561-6571

Abstract


We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First we leverage a large body of instructional videos and automatically mine a dataset of triplets of consecutive frames corresponding to initial object states actions and resulting object transformations. Second equipped with this data we develop and train a conditioned diffusion model dubbed GenHowTo. Third we evaluate GenHowTo on a variety of objects and actions and show superior performance compared to existing methods. In particular we introduce a quantitative evaluation where GenHowTo achieves 88% and 74% on seen and unseen interaction categories respectively outperforming prior work by a large margin.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Soucek_2024_CVPR, author = {Sou\v{c}ek, Tom\'a\v{s} and Damen, Dima and Wray, Michael and Laptev, Ivan and Sivic, Josef}, title = {GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {6561-6571} }