Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation

Feng, Yuheng; Wen, Changsong; Peng, Zelin; jiaye, Li; Zhu, Siyu

Yuheng Feng, Changsong Wen, Zelin Peng, Li jiaye, Siyu Zhu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 24895-24904

Abstract

Contrastive language-image pretraining models such as CLIP have demonstrated remarkable performance in various text-image alignment tasks. However, the inherent 77-token input limitation and reliance on predominantly short-text training data restrict its ability to handle long-text tasks effectively. To overcome these constraints, we propose LongD-CLIP, a dual-teacher distillation framework designed to enhance long-text representation while mitigating knowledge forgetting. In our approach, a teacher model, fine-tuned on long-text data, distills rich representation knowledge into a student model, while the original CLIP serves as a secondary teacher to help the student retain its foundational knowledge. Extensive experiments reveal that LongD-CLIP significantly outperforms existing models across long-text retrieval, short-text retrieval, and zero-shot image classification tasks. For instance, in the image-to-text retrieval task on the ShareGPT4V test set, LongD-CLIP exceeds Long-CLIP's performance by 2.5%, achieving an accuracy of 98.3%. Similarly, on the Urban-1k dataset, it records a 9.2% improvement, reaching 91.9%, thereby underscoring its robust generalization capabilities. Additionally, the text encoder of LongD-CLIP exhibits reduced latent space drift and improved compatibility with existing generative models, effectively overcoming the 77-token input constraint.

Related Material

[pdf]

[bibtex]

@InProceedings{Feng_2025_CVPR, author = {Feng, Yuheng and Wen, Changsong and Peng, Zelin and jiaye, Li and Zhu, Siyu}, title = {Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {24895-24904} }