HumanFormer: Human-centric Prompting Multi-modal Perception Transformer for Referring Crowd Detection

Qiu, Heqian; Wang, Lanxiao; Zhao, Taijin; Meng, Fanman; Li, Hongliang

Heqian Qiu, Lanxiao Wang, Taijin Zhao, Fanman Meng, Hongliang Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 5530-5540

Abstract

As an important step towards crowd understanding referring crowd detection (RCD) aims to locate the person in human crowded environments described by a natural language expression. Existing methods either rely on ambiguous object-based or token-based features for general scene understanding. However both of them ignore diverse fine-grained human properties and complex relationships crucial for locating the target person within similar persons. In this paper we propose a novel humancentric prompting multi-modal perception transformer (HumanFormer) to explicitly align fine-grained human concept information between visual and language modalities for accurate referring crowd detection. Specifically we introduce a human-centric prompt exporter to adaptively exploit various human-related parts and attribute prompt representation with prior knowledge. Based on part-level prompts we then design a part-prompting multi-modal encoder finely achieves cross-modal focusing fusion within each part region to avoid interference from irrelevant regions. Furthermore we leverage an attribute-prompting reasoning decoder to gradually infer the final object location according to their interactive relationships with fine-grained attribute representation language and vision sequentially. Extensive experimental results on the challenging RefCrowd other general benchmarks and JRDB dataset demonstrate the effectiveness and generality of the proposed method.

Related Material

[pdf]

[bibtex]

@InProceedings{Qiu_2024_CVPR, author = {Qiu, Heqian and Wang, Lanxiao and Zhao, Taijin and Meng, Fanman and Li, Hongliang}, title = {HumanFormer: Human-centric Prompting Multi-modal Perception Transformer for Referring Crowd Detection}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {5530-5540} }