Spatial Group-wise Enhance: Enhancing Semantic Feature Learning in CNN

Yuxuan Li, Xiang Li, Jian Yang; Proceedings of the Asian Conference on Computer Vision (ACCV), 2022, pp. 687-702


The success of attention modules in CNN has attracted increasing and widespread attention over the past years. However, most existing attention modules fail to consider two important factors: (1) For images, different semantic entities are located in different areas, thus they should be associated with different spatial attention masks; (2) most existing framework exploits individual local or global information to guide the generation of attention masks, which ignores the joint information of local-global similarities that can be more effective. To explore these two ingredients, we propose the Spatial Group-wise Enhance (SGE) module. SGE explicitly distributes different but accurate spatial attention masks for various semantics, through the guidance of local-global similarities inside each individual semantic feature group. Furthermore, SGE is lightweight with almost no extra parameters and calculations. Despite being trained with only category supervisions, SGE is effective in highlighting multiple active areas with various high-level semantics (such as the dog's eyes, nose, etc.). When integrated with popular CNN backbones, SGE can significantly boost their performance on image recognition tasks. Specifically, based on ResNet101 backbones, SGE improves the baseline by 0.7% Top-1 accuracy on ImageNet classification and 1.6% 1.8% AP on COCO detection tasks.The code and pretrained models are available at

Related Material

[pdf] [code]
@InProceedings{Li_2022_ACCV, author = {Li, Yuxuan and Li, Xiang and Yang, Jian}, title = {Spatial Group-wise Enhance: Enhancing Semantic Feature Learning in CNN}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2022}, pages = {687-702} }