Imagine This! Scripts to Compositions to Videos

Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, Aniruddha Kembhavi; Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 598-613


Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. As a step towards this goal, we present the Composition Retrieval and Fusion Networks (CRAFT), a model capable of learning this knowledge from video-caption data and applying it for generating videos from novel captions. CRAFT explicitly predicts a temporal-layout of mentioned entities (characters and objects), retrieves spatio-temporal entity segments from a video database and fuses them to generate scene videos. Our modeling contributions include sequential training of components of CRAFT while jointly modeling layout and appearances, and losses that encourage learning compositional representations for retrieval. We evaluate CRAFT on semantic fidelity to caption, composition consistency and visual quality. CRAFT outperforms direct pixel generation approaches and generalizes well to unseen captions as well as unseen video databases with no text annotations. We demonstrate CRAFT on FLINTSTONES, a new richly annotated video-caption dataset with over 25000 videos.

Related Material

[pdf] [arXiv]
author = {Gupta, Tanmay and Schwenk, Dustin and Farhadi, Ali and Hoiem, Derek and Kembhavi, Aniruddha},
title = {Imagine This! Scripts to Compositions to Videos},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}