Multi-Modal Few-Shot Temporal Action Segmentation

Lu, Zijia; Elhamifar, Ehsan

Zijia Lu, Ehsan Elhamifar; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 14106-14116

Abstract

Procedural videos are critical for learning new tasks. Temporal action segmentation (TAS), which classifies the action in every video frame, has become essential for understanding procedural videos. Existing TAS models, however, learn a fixed-set of tasks at training and unable to adapt to novel tasks at test time. Thus, we introduce the new problem of Multi-Modal Few-shot Temporal Action Segmentation (MMF-TAS) to learn open-set models that can generalize to novel procedural tasks with minimal visual/textual examples. We propose the first MMF-TAS framework, by designing a Prototype Graph Network (PGNet). In PGNet, a Prototype Building Block summarizes action information from support videos of the novel tasks via an Action Relation Graph, and encodes this information into action prototypes via a Dynamic Graph Transformer. Next, a Matching Block compares action prototypes with query videos to infer framewise action labels. To exploit the advantages of both visual and textual modalities, we compute separate action prototypes for each modality and combine the two modalities through prediction fusion to avoid overfitting on one modality. By extensive experiments on procedural datasets, we show our method successfully adapts to novel tasks during inference and significantly outperforms baselines. Our code is available at https://github.com/ZijiaLewisLu/ICCV2025-MMF-TAS.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Lu_2025_ICCV, author = {Lu, Zijia and Elhamifar, Ehsan}, title = {Multi-Modal Few-Shot Temporal Action Segmentation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {14106-14116} }