VMCML: Video and Music Matching via Cross-Modality Lifting

Yi-Shan Lee, Wei-Cheng Tseng, Fu-En Wang, Min Sun; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2060-2069

Abstract


We propose a content-based system for matching video and background music. The system aims to address the challenges in music recommendation for new users or new music give short-form videos. To this end we propose a cross-modal framework VMCML (Video and Music Matching via Cross-Modality Lifting) that finds a shared embedding space between video and music representations. To ensure the embedding space can be effectively shared by both representations we leverage CosFace loss based on margin-based cosine similarity loss. Furthermore to confirm the music is not the original sound of the video and that more than one video is matched to the same music we follow the rule and collect videos and music from a well-known multi-media platform. That is because there are limitations of previous datasets. We establish a large-scale dataset called MSV which provide 390 individual music and the corresponding matched 150000 videos. We conduct extensive experiments on Youtube-8M and our MSV datasets. Our quantitative and qualitative results demonstrate the effectiveness of our proposed framework and achieve state-of-the-art video and music matching performance.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Lee_2024_CVPR, author = {Lee, Yi-Shan and Tseng, Wei-Cheng and Wang, Fu-En and Sun, Min}, title = {VMCML: Video and Music Matching via Cross-Modality Lifting}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2060-2069} }