Semantically Enhanced Scene Captions with Physical and Weather Condition Changes

Hidetomo Sakaino; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 3654-3666

Abstract


Vision-Language models (VLMs), i.e., image-text pairs of CLIP, have boosted image-based Deep Learning (DL). Moreover, Visual-Question-Answer (VQA) tools and open-vocabulary semantic segmentation provide more detailed scene descriptions, i.e., qualitative texts, in captions. Images from surveillance, auto-drive, and mobile phone cameras have been used with segmentation and captions. However, unlike indoor scenes, outdoor scenes with uncontrolled illumination and noise can degrade the accuracy of segmented objects. Moreover, unpredictable events such as natural phenomena and accidents can cause dynamic and adverse scene changes over time. This greatly increases unseen objects due to sudden changes. Therefore, only a single state-of-the-art (SOTA) VLM and DL model cannot sufficiently generate and enhance captions. Even one time VQA is limited to generate a good answer. This paper proposes RoadCAP for refined and enriched qualitative and quantitative captions by DL models and VLMs with different tasks in a complementary manner. In particular, 2D-Contrastive Physical-Scale Pretraining (CPP) is also proposed for captions with physical scales. An iterative VQA model is proposed to refine incomplete segmented images with the prompts. Experimental results outperform SOTA DL models and VLMs using images with adverse conditions. A higher semantic level in captions for real-world scene descriptions is shown as compared with SOTA VLMs.

Related Material


[pdf]
[bibtex]
@InProceedings{Sakaino_2023_ICCV, author = {Sakaino, Hidetomo}, title = {Semantically Enhanced Scene Captions with Physical and Weather Condition Changes}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {3654-3666} }