Audio-Visual Class-Incremental Learning

Pian, Weiguo; Mo, Shentong; Guo, Yunhui; Tian, Yapeng

Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 7799-7811

Abstract

In this paper, we introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition. We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows. Furthermore, we observe that audio-visual correlations learned in previous tasks can be forgotten as incremental steps progress, leading to poor performance. To overcome these challenges, we propose AV-CIL, which incorporates Dual-Audio-Visual Similarity Constraint (D-AVSC) to maintain both instance-aware and class-aware semantic similarity between audio-visual modalities and Visual Attention Distillation (VAD) to retain previously learned audio-guided visual attentive ability. We create three audio-visual class-incremental datasets, AVE-Class-Incremental (AVE-CI), Kinetics-Sounds-Class-Incremental (K-S-CI), and VGGSound100-Class-Incremental (VS100-CI) based on the AVE, Kinetics-Sounds, and VGGSound datasets, respectively. Our experiments on AVE-CI, K-S-CI, and VS100-CI demonstrate that AV-CIL significantly outperforms existing class-incremental learning methods in audio-visual class-incremental learning. Code and data are available at: https://github.com/weiguoPian/AV-CIL_ICCV2023.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Pian_2023_ICCV, author = {Pian, Weiguo and Mo, Shentong and Guo, Yunhui and Tian, Yapeng}, title = {Audio-Visual Class-Incremental Learning}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {7799-7811} }