LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Chen, Joya; Zeng, Ziyun; Lin, Yiqi; Li, Wei; Ma, Zejun; Shou, Mike Zheng

Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, Mike Zheng Shou; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 29083-29095

Abstract

Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data pipeline for YouTube videos and their closed captions (CC), resulting in \texttt Live-CC-10M pre-training set and \texttt Live-WhisperX-408K high-quality supervised fine-tuning (SFT) set. Remarkably, even without SFT, the pre-trained model \texttt LiveCC-7B demonstrates significant improvements in general video QA and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new benchmark \texttt LiveSports-3K , using LLM-as-a-judge to measure the free-form commentary. Experiments show our final model \texttt LiveCC-7B can surpass LLaVA-Video-72B in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B scale on popular benchmarks such as VideoMME, demonstrating its broad generalizability. All resources of this paper have been released at \href https://showlab.github.io/livecc showlab.github.io/livecc .

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Chen_2025_CVPR, author = {Chen, Joya and Zeng, Ziyun and Lin, Yiqi and Li, Wei and Ma, Zejun and Shou, Mike Zheng}, title = {LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {29083-29095} }