Rugby Scene Classification Enhanced by Vision Language Model

Naoki Nonaka, Ryo Fujihira, Toshiki Koshiba, Akira Maeda, Jun Seita; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 3256-3266

Abstract


This study investigates the integration of vision language models (VLM) to enhance the classification of situations within rugby match broadcasts. The importance of accurately identifying situations in sports videos is emphasized for understanding game dynamics and facilitating downstream tasks like performance evaluation and injury prevention. Utilizing a dataset comprising 18000 labeled images extracted at 0.2-second intervals from 100 minutes of rugby match broadcasts scene classification tasks including contact plays (scrums mauls rucks tackles lineouts) rucks tackles lineouts and multiclass classification were performed. The study aims to validate the utility of VLM outputs in improving classification performance compared to using solely image data. Experimental results demonstrate substantial performance improvements across all tasks with the incorporation of VLM outputs. Our analysis of prompts suggests that when provided with appropriate contextual information through natural language VLMs can effectively capture the context of a given image. The findings of our study indicate that leveraging VLMs in the domain of sports analysis holds promise for developing image processing models capable of incorpolating the tacit knowledge encoded within language models as well as information conveyed through natural language descriptions.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Nonaka_2024_CVPR, author = {Nonaka, Naoki and Fujihira, Ryo and Koshiba, Toshiki and Maeda, Akira and Seita, Jun}, title = {Rugby Scene Classification Enhanced by Vision Language Model}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {3256-3266} }