-
[pdf]
[supp]
[bibtex]@InProceedings{Wang_2026_CVPR, author = {Wang, Sijie and Zhu, Yingying}, title = {Task-Specific Knowledge Improves Generalization: A Logits-Based Framework for Continual Learning of Vision-Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, month = {June}, year = {2026}, pages = {7615-7624} }
Task-Specific Knowledge Improves Generalization: A Logits-Based Framework for Continual Learning of Vision-Language Models
Abstract
Continual learning of vision-language models faces a fundamental plasticity-stability dilemma. Existing methods typically treat task fine-tuning and zero-shot generalization as opposing forces, leading to limited adaptation to learned in-distribution (ID) tasks and an inability to surpass the zero-shot capacity of pretrained models on unseen out-of-distribution (OOD) tasks. To address this, we propose a novel logits-based framework with a dynamic architecture that, for the first time, elegantly unifies these two conflicting attributes at the logits level. During training, we jointly optimize learnable text prompts and the Parameter-Efficient Fine-Tuning (PEFT) modules integrated into the encoder. At inference, an improved Mahalanobis distance-based router identifies ID and OOD samples. For OOD samples, a logits ensemble strategy selects the lowest-entropy logits and interpolates them with those from the vanilla CLIP to mitigate overconfidence. Furthermore, under the more challenging Cross-domain Task-Agnostic Incremental Learning (X-TAIL) setting, we further improve ID performance by selecting representative prompts and amplifying their logits. Experiments show substantial gains on both ID and OOD tasks under both Multi-domain Task Incremental Learning (MTIL) and X-TAIL settings.
Related Material

