Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

You, Xiaoxing; Huang, Qiang; Li, Lingyu; Chang, Xiaojun; Yu, Jun

Xiaoxing You, Qiang Huang, Lingyu Li, Xiaojun Chang, Jun Yu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 26219-26229

Abstract

Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that **CoE** consistently outperforms state-of-the-art video CoT baselines, achieving average gains of **+3.04 ROUGE**, **+9.51 CIDEr**, and **+1.88 BERTScore**, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{You_2026_CVPR, author = {You, Xiaoxing and Huang, Qiang and Li, Lingyu and Chang, Xiaojun and Yu, Jun}, title = {Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {26219-26229} }