ViTA: An Efficient Video-to-Text Algorithm  using VLM for RAG-based Video Analysis System

Arefeen, Md Adnan; Debnath, Biplob; Uddin, Md Yusuf Sarwar; Chakradhar, Srimat

ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based Video Analysis System

Md Adnan Arefeen, Biplob Debnath, Md Yusuf Sarwar Uddin, Srimat Chakradhar; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2266-2274

Abstract

Retrieval-augmented generation (RAG) is used in natural language processing (NLP) to provide query-relevant information in enterprise documents to large language models (LLMs). Such enterprise context enables the LLMs to generate more informed and accurate responses. When enterprise data is primarily videos AI models like vision language models (VLMs) are necessary to convert information in videos into text. While essential this conversion is a bottleneck especially for large corpus of videos. It delays the timely use of enterprise videos to generate useful responses. We propose ViTA a novel method that leverages two unique characteristics of VLMs to expedite the conversion process. As VLMs output more text tokens they incur higher latency. In addition large (heavyweight) VLMs can extract intricate details from images and videos but they incur much higher latency per output token when compared to smaller (lightweight) VLMs that may miss details. To expedite conversion ViTA first employs a lightweight VLM to quickly understand the gist or overview of an image or a video clip and directs a heavyweight VLM (through prompt engineering) to extract additional details by using only a few (preset number of) output tokens. Our experimental results show that ViTA expedites the conversion time by as much as 43% without compromising the accuracy of responses when compared to a baseline system that only uses a heavyweight VLM.

Related Material

[pdf]

[bibtex]

@InProceedings{Arefeen_2024_CVPR, author = {Arefeen, Md Adnan and Debnath, Biplob and Uddin, Md Yusuf Sarwar and Chakradhar, Srimat}, title = {ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based Video Analysis System}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2266-2274} }