Cascaded Siamese Self-Supervised Audio to Video GAN
Generating meaningful videos that are synchronised to audio signals is a complex synthesis task that requires generation of not only realistic videos but also coherent video motions that conform to the provided audio signals. While tremendous effort has been expended on audio-to-video generative models, these models rely heavily on supervised signals such as face/body key points or 3D meshes. However, key point annotation requires time and effort. Besides, some dataset domains do not have predictable structure, which makes the extraction of points of interest infeasible. Our proposed model consists of a cascaded generator-discriminator architecture that works at the pixel level to generate videos according to the associated soundtracks. It adopts a new self-supervised temporal augmentation technique to optimise the correlation between the audio signal and the generated video instead of relying on supervised signals. The proposed architecture has proven its effectiveness in extensive experiments that compared different models across two datasets.