-
[pdf]
[bibtex]@InProceedings{Li_2025_WACV, author = {Li, Xiang and He, Yangfan and Zu, Shuaishuai and Li, Zhengyang and Shi, Tianyu and Xie, Yiting and Zhang, Kevin}, title = {Multi-Modal Large Language Model with RAG Strategies in Soccer Commentary Generation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6197-6206} }
Multi-Modal Large Language Model with RAG Strategies in Soccer Commentary Generation
Abstract
As a globally celebrated sport soccer has seen its appeal greatly amplified by engaging and vivid commentary. Recently Multi-Modal Large Language Models (MLLMs) have attracted attention in generating soccer commentaries due to their remarkable capacities of understanding different modalities of the input videos. Most of these methods have shown that the use of multiple modalities can enhance the commentary quality which includes video audio and structured meta-data. However delivering precise and rich commentary requires the ability to accurately discern subtle differences in similar backgrounds events and players. This presents a significant challenge for existing MLLMs. So we propose SoccerComment a framework for generating soccer commentary that integrates MLLMs with Retrieval-Augmented Generation (RAG) strategies. This framework enhances inference efficiency and reduces the need for continuous training through a multi-modal clustering memory unit and retrieval-augmented in-context learning mechanisms ultimately improving the accuracy and diversity of the commentary. Based on similar retrieved scenarios SoccerComment demonstrates outstanding zero-shot performance offering a new direction and scalable solution for future research in soccer commentary generation.
Related Material