Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions

Wu, Chi Hsuan; Ashutosh, Kumar; Grauman, Kristen

Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 23988-23999

Abstract

When obtaining visual illustrations from text descriptions, today's methods take a description with a single text context--a caption, or an action description--and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Wu_2026_CVPR, author = {Wu, Chi Hsuan and Ashutosh, Kumar and Grauman, Kristen}, title = {Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {23988-23999} }