X-Fusion: Introducing New Modality to Frozen Large Language Models

Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, Bolei Zhou, Yuheng Li; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 228-238

Abstract


We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Mo_2025_ICCV, author = {Mo, Sicheng and Nguyen, Thao and Huang, Xun and Iyer, Siddharth Srinivasan and Li, Yijun and Liu, Yuchen and Tandon, Abhishek and Shechtman, Eli and Singh, Krishna Kumar and Lee, Yong Jae and Zhou, Bolei and Li, Yuheng}, title = {X-Fusion: Introducing New Modality to Frozen Large Language Models}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {228-238} }