Synthesizing Compositional Videos from Text Description

Prajwal Singh, Kuldeep Kulkarni, Shanmuganathan Raman, Harsh Rangwani; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026, pp. 6775-6784

Abstract


Existing pre-trained text-to-video diffusion models can generate high-quality videos, but often struggle with misalignment between the generated content and the input text, particularly while composing scenes with multiple objects. To tackle this issue, we propose a straightforward, training-free approach for compositional video generation from text. We introduce Video-ASTAR for test-time aggregation and segregation of attention with a novel centroid loss to enhance alignment, which enables the generation of multiple objects in the scene, modeling the actions and interactions. Additionally, we extend our approach to the Multi-Action video generation setting, where only the specified action should vary across a sequence of prompts. To ensure coherent action transitions, we introduce a novel token-swapping and latent interpolation strategy. Extensive experiments and ablation studies show that our method significantly outperforms baseline methods, generating videos with improved semantic and compositional consistency alongside improved temporal coherence.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Singh_2026_WACV, author = {Singh, Prajwal and Kulkarni, Kuldeep and Raman, Shanmuganathan and Rangwani, Harsh}, title = {Synthesizing Compositional Videos from Text Description}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, pages = {6775-6784} }