Optimizing Vision-Language Model for Road Crossing Intention Estimation

Uziel, Roy; Bialer, Oded

Roy Uziel, Oded Bialer; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1702-1712

Abstract

Identifying a pedestrian's intention to cross the road is crucial for autonomous driving as it alerts the system to stop or slow down. However determining crossing intention from video is challenging due to the need for extracting complex high-level semantics. This paper introduces ClipCross a novel classification framework optimized to extract high-level semantic features using the vision-language model CLIP for determining crossing intention. Existing CLIP-based methods perform poorly in this task as CLIP's image and text encoders fail to capture the nuanced semantic distinctions between crossing and non-crossing intention images. ClipCross addresses this by optimizing a set of CLIP text embeddings to extract high-level semantic features which a multi-layer perceptron uses to distinguish between crossing and non-crossing intentions. ClipCross achieves state-of-the-art performance on crossing intention estimation benchmark datasets: PIE PSI and JAAD.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Uziel_2025_WACV, author = {Uziel, Roy and Bialer, Oded}, title = {Optimizing Vision-Language Model for Road Crossing Intention Estimation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {1702-1712} }