LoCoNet: Long-Short Context Network for Active Speaker Detection

Xizi Wang, Feng Cheng, Gedas Bertasius; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18462-18472

Abstract


Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. Solving ASD involves using audio and visual information in two complementary contexts: long-term intra-speaker context models the temporal dependencies of the same speaker while short-term inter-speaker context models the interactions of speakers in the same scene. Motivated by these observations we propose LoCoNet a simple but effective Long-Short Context Network that leverages Long-term Intra-speaker Modeling (LIM) and Short-term Inter-speaker Modeling (SIM) in an interleaved manner. LIM employs self-attention for long-range temporal dependencies modeling and cross-attention for audio-visual interactions modeling. SIM incorporates convolutional blocks that capture local patterns for short-term inter-speaker context. Experiments show that LoCoNet achieves state-of-the-art performance on multiple datasets with 95.2% (+0.3%) mAP on AVA-ActiveSpeaker 97.2% (+2.7%) mAP on Talkies and 68.4% (+7.7%) mAP on Ego4D. Moreover in challenging cases where multiple speakers are present LoCoNet outperforms previous state-of-the-art methods by 3.0% mAP on AVA-ActiveSpeaker. The code is available at https://github.com/SJTUwxz/LoCoNet_ASD.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Wang_2024_CVPR, author = {Wang, Xizi and Cheng, Feng and Bertasius, Gedas}, title = {LoCoNet: Long-Short Context Network for Active Speaker Detection}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18462-18472} }