Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model

Israel D. Gebru, Sileye Ba, Georgios Evangelidis, Radu Horaud; Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, 2015, pp. 15-21

Abstract


Any multi-party conversation system benefits from speaker diarization, that is, the assignment of speech signals among the participants. We here cast the diarization problem into a tracking formulation whereby the active speaker is detected and tracked over time. A probabilistic tracker exploits the on-image (spatial) coincidence of visual and auditory observations and infers a single latent variable which represents the identity of the active speaker. Both visual and auditory observations are explained by a recently proposed weighted-data mixture model, while several options for the speaking turns dynamics are fulfilled by a multi-case transition model. The modules that translate raw audio and visual data into on-image observations are also described in detail. The performance of the proposed tracker is tested on challenging data-sets that are available from recent contributions which are used as baselines for comparison. mixture model, while several options for the speaking turns dynamics are fulfilled by a multi-case transition model. The modules that translate raw audio and visual data into on-image observations are also described in detail. The performance of the proposed tracker is tested on challenging data-sets that are available from recent contributions which are used as baselines for comparison.

Related Material


[pdf]
[bibtex]
@InProceedings{Gebru_2015_ICCV_Workshops,
author = {Gebru, Israel D. and Ba, Sileye and Evangelidis, Georgios and Horaud, Radu},
title = {Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops},
month = {December},
year = {2015}
}