Generalist YOLO: Towards Real-Time End-to-End Multi-Task Visual Language Models

Hung-Shuo Chang, Chien-Yao Wang, Richard Robert Wang, Gene Chou, Hong-Yuan Mark Liao; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 6217-6227

Abstract


Generalist models capable of handling multiple modalities and tasks simultaneously are currently one of the hottest research topics. However due to interference between different tasks during the training process existing generalist models require a very large decoder to achieve good results in various tasks which makes real-time prediction difficult for current generalist models. This paper introduces Generalist YOLO which takes a significant step towards real-time prediction systems for visual language generalist models. The proposed Generalist YOLO uses a unified encoder to reduce conflicts between different tasks thereby decreasing the complexity required by the decoder. It also introduces a primary-secondary co-attention mechanism that allows different tasks to learn together more effectively achieving high efficiency and high accuracy. We propose a semantically consistent asymmetric training strategy allowing various tasks to benefit from performance improvements brought by the latest research results in various fields. The proposed Generalist YOLO achieves excellent results on various vision and language tasks based on MS COCO. While maintaining high accuracy across all tasks it is 135 times faster than existing generalist models. The source code is released on GitHub at https://github.com/WongKinYiu/GeneralistYOLO.

Related Material


[pdf]
[bibtex]
@InProceedings{Chang_2025_WACV, author = {Chang, Hung-Shuo and Wang, Chien-Yao and Wang, Richard Robert and Chou, Gene and Liao, Hong-Yuan Mark}, title = {Generalist YOLO: Towards Real-Time End-to-End Multi-Task Visual Language Models}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6217-6227} }