Scaling laws in zero-shot gender classification using CLIP

Lucas M. Ceschini, Gabriel O. Ramos, Claudio R. Jung; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 21-29

Abstract


Gender classification is a computer vision task key in biometrics, surveillance, targeted advertisement, and demographic studies. Multimodal vision-language models, such as Openai's CLIP, showed promising zero-shot capabilities in several downstream tasks, without the need for fine-tuning or retraining, allowing end-users to simply input an image and ask for its characteristics. This could provide an efficient solution for small projects and applications that require a gender classification feature but don't have the know-how or the resources required to run state-of-the-art models. However, given the nature of the data used to train these models, and the impact of the textual prompt in the classification results, it could cause harm by amplifying known stereotypes and prejudice. In this paper, we investigate the impact of data and model scaling for zero-shot gender classification using openCLIP, an open-source implementation of the Contrastive Language-Image Pre-training model. We perceived a minor improvement in accuracy and fairness with the scaling of model parameters, yet we did not see any improvement with the scaling of training data, suggesting that raw data alone is not enough. We also explored how textual prompt ensembling and aggregation techniques could improve fairness by reducing the difference between gender accuracies.

Related Material


[pdf]
[bibtex]
@InProceedings{Ceschini_2025_CVPR, author = {Ceschini, Lucas M. and Ramos, Gabriel O. and Jung, Claudio R.}, title = {Scaling laws in zero-shot gender classification using CLIP}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {21-29} }