-
[pdf]
[supp]
[bibtex]@InProceedings{Li_2026_CVPR, author = {Li, Shuo and Miao, Bingchen and Bu, Wendong and Li, Juncheng and Zhang, Hanwang and Wu, Fei}, title = {DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {7847-7858} }
DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated promising advancements in augmenting the capabilities of LLMs to comprehend visual input. However, modality misalignment between vision and text remains a key challenge in MLLM, which can be attributed to two aspects: misalignment of modality-specific representations and depletion of modality-specific details. To address the issue of modality misalignment, we propose DeepAlign, a novel multimodal alignment framework to mitigate modality conflict, which employs representation intervention and structure-induced knowledge distillation to prevent the misalignment and depletion of modality-specific information. Extensive experiments demonstrate that DeepAlign significantly mitigates modality conflicts, leading to substantial performance improvements compared to backbone models across multiple vision-language tasks. It also stimulates some emergent abilities in MLLMs, such as multimodal in-context learning on interleaved text-image sequences.
Related Material

