Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Hirota, Yusuke; Hachiuma, Ryo; Li, Boyi; Lu, Ximing; Boone, Michael Ross; Ivanovic, Boris; Choi, Yejin; Pavone, Marco; Wang, Yu-Chiang Frank; Garcia, Noa; Nakashima, Yuta; Yang, Chao-Han Huck

Yusuke Hirota, Ryo Hachiuma, Boyi Li, Ximing Lu, Michael Ross Boone, Boris Ivanovic, Yejin Choi, Marco Pavone, Yu-Chiang Frank Wang, Noa Garcia, Yuta Nakashima, Chao-Han Huck Yang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 8634-8644

Abstract

Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Hirota_2025_ICCV, author = {Hirota, Yusuke and Hachiuma, Ryo and Li, Boyi and Lu, Ximing and Boone, Michael Ross and Ivanovic, Boris and Choi, Yejin and Pavone, Marco and Wang, Yu-Chiang Frank and Garcia, Noa and Nakashima, Yuta and Yang, Chao-Han Huck}, title = {Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {8634-8644} }