CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario

Zhizhao Duan, Hao Cheng, Duo Xu, Xi Wu, Xiangxie Zhang, Xi Ye, Zhen Xie; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7180-7189

Abstract


In the vast and dynamic landscape of urban settings Traffic Safety Description and Analysis plays a pivotal role in applications ranging from insurance inspection to accident prevention. This paper introduces CityLLaVA a novel fine-tuning framework for Visual Language Models (VLMs) designed for urban scenarios. CityLLaVA enhances model comprehension and prediction accuracy through (1) employing bounding boxes for optimal visual data preprocessing including video best-view selection and visual prompt engineering during both training and testing phases; (2) constructing concise Question-Answer sequences and designing textual prompts to refine instruction comprehension; (3) implementing block expansion to fine-tune large VLMs efficiently; and (4) advancing prediction accuracy via a unique sequential questioning-based prediction augmentation. Demonstrating top-tier performance our method achieved a benchmark score of 33.4308 securing the leading position on the leaderboard. The code will be released soon.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Duan_2024_CVPR, author = {Duan, Zhizhao and Cheng, Hao and Xu, Duo and Wu, Xi and Zhang, Xiangxie and Ye, Xi and Xie, Zhen}, title = {CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7180-7189} }