Image-sensitive language modeling for automatic speech recognition

Kata Naszadi, Youssef Oualil, Dietrich Klakow; Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0-0

Abstract


Typically language models in a speech recognizer just use the previous words as a context. Thus they are insensitive to context from the real world. This paper explores the benefits of introducing the visual modality as context information to automatic speech recognition. We use neural multimodal language models to rescore the recognition results of utterances that describe visual scenes. We provide a comprehensive survey of how much the language model improves when adding the image to the conditioning set. The image was introduced to a purely text-based RNN-LM using three different composition methods. Our experiments show that using the visual modality helps the recognition process by a 7.8% relative improvement, but can also hurt the results because of overfitting to the visual input.

Related Material


[pdf]
[bibtex]
@InProceedings{Naszadi_2018_ECCV_Workshops,
author = {Naszadi, Kata and Oualil, Youssef and Klakow, Dietrich},
title = {Image-sensitive language modeling for automatic speech recognition},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV) Workshops},
month = {September},
year = {2018}
}