CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets

Tanay Agrawal, Mohammed Guermal, Michal Balazia, Francois Bremond; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 7379-7388

Abstract


Challenges in cross-learning involve inhomogeneous or even inadequate amount of training data and lack of resources for retraining large pretrained models. Inspired by transfer learning techniques in NLP adapters and prefix tuning this paper presents a new model-agnostic plugin architecture for cross-learning called CM3T that allows transformer-based models to be able adapt to new or missing information. We introduce two adapter blocks: multi-head vision adapters for transfer learning and cross-attention adapters for multimodal learning. Training becomes substantially efficient as the backbone and other plugins do not need to be finetuned along with these additions. Experiments and ablation studies on three datasets - Epic-Kitchens-100 MPIIGroupInteraction and UDIVA v0.5 - show the efficacy of this framework on different recording settings and tasks. With only 12.8% trainable parameters compared to the backbone to process video input and 22.3% trainable parameters for two additional modalities we achieve comparable and even better results than the state-of-the-art. Compared to similar methods our work achieves this result without any specific requirements for training or pretraining and is a step towards bridging the gap between a general model and specific practical applications in the field of video classification.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Agrawal_2025_WACV, author = {Agrawal, Tanay and Guermal, Mohammed and Balazia, Michal and Bremond, Francois}, title = {CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {7379-7388} }