Visually Indicated Sound Generation by Perceptually Optimized Classification

Kan Chen, Chuanxi Zhang, Chen Fang, Zhaowen Wang, Trung Bui, Ram Nevatia; Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0-0

Abstract


Visually indicated sound generation aims to predict visually consistent sound from the video content. Previous methods addressed this problem by creating a single generative model that ignores the distinctive characteristics of various sound categories. Nowadays, state-ofthe-art sound classification networks are available to capture semanticlevel information in audio modality, which can also serve for the purpose of visually indicated sound generation. In this paper, we explore generating fine-grained sound from a variety of sound classes, and leverage pre-trained sound classification networks to improve the audio generation quality. We propose a novel Perceptually Optimized Classification based Audio generation Network (POCAN), which generates sound conditioned on the sound class predicted from visual information. Additionally, a perceptual loss is calculated via a pre-trained sound classification network to align the semantic information between the generated sound and its ground truth during training. Experiments show that POCAN achieves significantly better results in visually indicated sound generation task on two datasets.

Related Material


[pdf]
[bibtex]
@InProceedings{Chen_2018_ECCV_Workshops,
author = {Chen, Kan and Zhang, Chuanxi and Fang, Chen and Wang, Zhaowen and Bui, Trung and Nevatia, Ram},
title = {Visually Indicated Sound Generation by Perceptually Optimized Classification},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV) Workshops},
month = {September},
year = {2018}
}