Compositional Transformation Reasoning for Composed Video Retrieval

Huang, Sihong; Wu, Jiaxin; Jiang, Dongmei; Cai, Yi; Wang, Yaowei; Wei, Xiaoyong

Sihong Huang, Jiaxin Wu, Dongmei Jiang, Yi Cai, Yaowei Wang, Xiaoyong Wei; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 25644-25653

Abstract

Composed Video Retrieval aims to retrieve a target video given a reference video and a textual modification describing the desired change. The core challenge lies in modeling compositional multimodal transformations, i.e., how entities, actions, and scenes evolve across video and language modalities in response to fine-grained textual edits. Existing methods address this issue by training on large-scale video-text-video triplets or by generating dense textual descriptions to capture subtle visual differences. However, these supervised approaches often rely on noisy web-scale data and dataset-specific correspondences, leading to overfitting and limited generalization in diverse or fine-grained scenarios, while also failing to effectively model compositional and temporal transformations. We propose Multi-objective Reasoning (MoRe), a zero-shot framework based on MLLMs for multi-objective candidate selection and fine-grained transformation reasoning. Our method decomposes the compositional transformation into three complementary reasoning dimensions, i.e., entity, action, and scene, and performs pairwise candidate reasoning to explicitly capture semantic evolution over time. Furthermore, we introduce a recall-oriented multi-objective candidate selection module that identifies high-quality retrieval targets by jointly balancing visual, textual, and multimodal similarities before transformation reasoning. Experiments on EgoCVR and WebVid-CoVR demonstrate the effectiveness of our method over state-of-the-art approaches under the zero-shot setting, with R@1 improvements of +5.8 and +10.8, respectively.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Huang_2026_CVPR, author = {Huang, Sihong and Wu, Jiaxin and Jiang, Dongmei and Cai, Yi and Wang, Yaowei and Wei, Xiaoyong}, title = {Compositional Transformation Reasoning for Composed Video Retrieval}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {25644-25653} }