MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

Zhang, Jianyi; Yang, Hao; Li, Ang; Guo, Xin; Wang, Pu; Wang, Haiming; Chen, Yiran; Li, Hai

Jianyi Zhang, Hao Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 4066-4076

Abstract

Previous studies on federated learning (FL) often encounter performance degradation due to data heterogeneity among different clients. In light of the recent advances in multimodal large language models (MLLMs) such as GPT-4v and LLaVA which demonstrate their exceptional proficiency in multimodal tasks such as image captioning and multimodal question answering. We introduce a novel federated learning framework named Multimodal Large Language Model Assisted Federated Learning (MLLM-LLaVA-FL) which employs powerful MLLMs at the server end to address the heterogeneous and long-tailed challenges. Owing to the advanced cross-modality representation capabilities and the extensive open-vocabulary prior knowledge of MLLMs our framework is adept at harnessing the extensive yet previously underexploited open-source data accessible from websites and powerful server-side computational resources. Hence the MLLM-LLaVA-FL not only enhances the performance but also avoids increasing the risk of privacy leakage and the computational burden on local devices distinguishing it from prior methodologies. Our framework has three key stages. Initially we conduct global visual-text pretraining of the model. This pretraining is facilitated by utilizing the extensive open-source data available online with the assistance of MLLMs. Subsequently the pretrained model is distributed among various clients for local training. Finally once the locally trained models are transmitted back to the server a global alignment is carried out under the supervision of MLLMs to further enhance the performance. Experimental evaluations on established benchmarks show that our framework delivers promising performance in the typical scenarios with data heterogeneity and long-tail distribution across different clients in FL.

Related Material

[pdf]

[bibtex]

@InProceedings{Zhang_2025_WACV, author = {Zhang, Jianyi and Yang, Hao and Li, Ang and Guo, Xin and Wang, Pu and Wang, Haiming and Chen, Yiran and Li, Hai}, title = {MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {4066-4076} }