gScoreCAM: What objects is CLIP looking at?

Peijie Chen, Qi Li, Saad Biaz, Trung Bui, Anh Nguyen; Proceedings of the Asian Conference on Computer Vision (ACCV), 2022, pp. 1959-1975


Large-scale, multimodal models trained on web data such as OpenAI's CLIP are becoming the foundation of many applications. Yet, they are also more complex to understand, test, and therefore align with human values. In this paper, we propose gScoreCAM--a state-of-the-art method for visualizing the main objects that CLIP is looking at in an image. On zero-shot object detection, gScoreCAM performs similarly to ScoreCAM, the best prior art on CLIP, yet 8 to 10 times faster. Our method outperforms other existing, well-known methods (HilaCAM, RISE, and the entire CAM family) by a large margin, especially in multi-object scenes. gScoreCAM sub-samples k = 300 channels (from 3,072 channels--i.e. reducing complexity by almost 10 times) of the highest gradients and linearly combines them into a final "attention" visualization. We demonstrate the utility and superiority of our method on three datasets: ImageNet, COCO, and PartImageNet. Our work opens up interesting future directions in understanding and de-biasing CLIP.

Related Material

[pdf] [supp] [code]
@InProceedings{Chen_2022_ACCV, author = {Chen, Peijie and Li, Qi and Biaz, Saad and Bui, Trung and Nguyen, Anh}, title = {gScoreCAM: What objects is CLIP looking at?}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2022}, pages = {1959-1975} }