Unseen And Adverse Outdoor Scenes Recognition Through Event-Based Captions

Hidetomo Sakaino; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 3594-3601

Abstract


This paper presents EventCAP, i.e., event-based captions, for refined and enriched qualitative and quantitative captions by Deep Learning (DL) models and Vision Language Models (VLMs) with different tasks in a complementary manner. Indoor and outdoor images are used for object recognition and captioning. However, outdoor images of events change in wide ranges due to natural phenomena, i.e., weather changes. Such dynamical changes may degrade segmentation by illumination and object shape changes. This increases unseen objects and scenes under such adverse conditions. On the other hand, single state-of-art (SOTA) DLs and VLMs work with single or limited tasks. Therefore, this paper proposes EventCAP with captions with physical scales and objects' surface properties. Moreover, an iterative VQA model is proposed to refine incomplete segmented images with the prompts. A higher semantic level in captions for real-world scene descriptions is experimentally shown compared to SOTA VLMs.

Related Material


[pdf]
[bibtex]
@InProceedings{Sakaino_2023_ICCV, author = {Sakaino, Hidetomo}, title = {Unseen And Adverse Outdoor Scenes Recognition Through Event-Based Captions}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {3594-3601} }