Understanding Multi-Task Activities from Single-Task Videos

Yuhan Shen, Ehsan Elhamifar; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 19120-19131

Abstract


(MT-TAS), a novel paradigm that addresses the challenges of interleaved actions when performing multiple tasks simultaneously. Traditional action segmentation models, trained on single-task videos, struggle to handle task switches and complex scenes inherent in multi-task scenarios. To overcome these challenges, our MT-TAS approach synthesizes multi-task video data from single-task sources using our Multi-task Sequence Blending and Segment Boundary Learning modules. Additionally, we propose to dynamically isolate foreground and background elements within video frames, addressing the intricacies of object layouts in multi-task scenarios and enabling a new two-stage temporal action segmentation framework with Foreground-Aware Action Refinement. Also, we introduce the Multi-task Egocentric Kitchen Activities (MEKA) dataset, containing 12 hours of egocentric multi-task videos, to rigorously benchmark MT-TAS models. Extensive experiments demonstrate that our framework effectively bridges the gap between single-task training and multi-task testing, advancing temporal action segmentation with state-of-the-art performance in complex environments.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Shen_2025_CVPR, author = {Shen, Yuhan and Elhamifar, Ehsan}, title = {Understanding Multi-Task Activities from Single-Task Videos}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {19120-19131} }