Vision-Language Models Performing Zero-Shot Tasks Exhibit Disparities Between Gender Groups

Melissa Hall, Laura Gustafson, Aaron Adcock, Ishan Misra, Candace Ross; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 2778-2785

Abstract


We explore the extent to which zero-shot vision-language models exhibit gender bias for different vision tasks. Vision models traditionally required task-specific labels for representing concepts, as well as finetuning; zero-shot models like CLIP instead perform tasks with an open-vocabulary, meaning they do not need a fixed set of labels, by using text embeddings to represent concepts. With these capabilities in mind, we ask: Do vision-language models exhibit gender bias when performing zero-shot image classification, object detection and semantic segmentation? We evaluate different vision-language models with multiple datasets across a set of concepts and find (i) all models evaluated show distinct performance differences when identifying concepts based on the gender of the person co-occurring in the image (ii) model calibration (i.e., the relationship between accuracy and confidence) also differs distinctly by gender, even when evaluating on similar representations of concepts and (iii) these observed disparities align with existing gender biases in word embeddings from language models. These findings suggest that, while language greatly expands the capability of vision tasks, it can contribute to propagating social biases in zero-shot settings

Related Material


[pdf]
[bibtex]
@InProceedings{Hall_2023_ICCV, author = {Hall, Melissa and Gustafson, Laura and Adcock, Aaron and Misra, Ishan and Ross, Candace}, title = {Vision-Language Models Performing Zero-Shot Tasks Exhibit Disparities Between Gender Groups}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {2778-2785} }