CrossPAR: Enhancing Pedestrian Attribute Recognition with Vision-Language Fusion and Human-Centric Pre-training

Ngo, Bach-Hoang; Ngo, Si-Tri; Le, Phu-Duc; Phan, Quang-Minh; Tran, Minh-Triet; Le, Trung-Nghia

Bach-Hoang Ngo, Si-Tri Ngo, Phu-Duc Le, Quang-Minh Phan, Minh-Triet Tran, Trung-Nghia Le; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 1301-1315

Abstract

Pedestrian attribute recognition (PAR) is crucial in various applications like surveillance and urban planning. Accurately identifying attributes in diverse and intricate urban environments is challenging despite its significance. This paper introduces a novel network for PAR that integrates a human-centric encoder, trained on extensive human datasets, with a vision-language encoder, trained on substantial text-image pair datasets. We also develop a cross-attention mechanism utilizing a Mixture-of-Experts approach that combines the human-centric encoder's proficiency in local attribute detection with the vision-language encoder's ability to comprehend global content. CrossPAR showcases a comparable accuracy result compared to existing techniques across multiple benchmarks, using less training data. These results confirm our approach's effectiveness and suggest promising avenues for further research and practical applications in the domain of PAR and related fields.

Related Material

[pdf]

[bibtex]

@InProceedings{Ngo_2024_ACCV, author = {Ngo, Bach-Hoang and Ngo, Si-Tri and Le, Phu-Duc and Phan, Quang-Minh and Tran, Minh-Triet and Le, Trung-Nghia}, title = {CrossPAR: Enhancing Pedestrian Attribute Recognition with Vision-Language Fusion and Human-Centric Pre-training}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {1301-1315} }