Related Work¶
Related Work in Gender Bias¶
Following the initial experiments performed by the CLIP authors in Section 7.1 Bias, several works have focused on exploring social biases present in vision language models. These initial experiments detected racial and gender biases in image classification tasks when prompting CLIP with human and non-human labels, and gender-agnostic profession labels.
in \cite{gender_disparities} the authors explore gender bias in vision language models by probing CLIP and two other models built upon it, Detic \cite{detic} for bounding box detection and LSeg \cite{lseg} for semantic segmentation. To compare the performance of the models in image classification, object detection and segmentation, they used the Visual Genome dataset \cite{visual_genome}, where each image contain bounding boxes for persons and gender-neutral objects like bag, necklace and sweater. All models performed better for images containing women for these three objects, while performing better for images containing men for objects like necktie, wheel and hat. They also point that a common practice of reporting the mean average precision across all concepts as a single, summary statistic of model performance can potentially mask the disparities between them, and lend a false assurance of model consistency across demographic groups (unbiased). To explore the impact of the dataset in the studied tasks, they employed both the Visual Genome and the COCO \cite{coco} datasets. In a range of concepts they showed that the bias is present in both of them, therefore it is not dataset specific. Furthermore, They found that the bias in word embeddings parallel the biases in the zero-shot vision-language models, and as such verifies the claims that biases from language models are passed to vision models learnt from it.
According to \cite{fairer_clip}, model bias can be classified based on the dependencies between the data attributes. Such attributes can be dependent or independent from each other. Correlations between dependent attributes, such as high cheekbone and sex, are called an intrinsic dependence, while the correlation between independent attributes, such as hair color and sex, are called a spurious correlation. The authors claim that debiasing methods are commonly focused on the spurious correlations, requiring or not ground-truth labels and use iterative methods computationally costly. Their method is supposed to address all these issues. They developed a method to learn the biases from text and image embeddings of CLIP and deploy their model as a post-processing stage after CLIP encoding and prior to cosine similarity classifiers.
The authors of \cite{biased_prompts} propose a calibration loss that minimizes the discrepancy between a pair of prompt embeddings that contain biases. By doing so, the embeddings of both male and female versions of a given prompt should be similar, and as such, debiased. Only focusing on the text embedding was sufficient to improve group robustness of zero-shot models, corroborating the claims of parallelism between VLMs and LLMs biases.
Leveraging this findings, The work of \citeonline{dark_side} explored the effect of dataset scaling in racial classification. They used openCLIP's models and Chicago Face Dataset (CFD) to probe the model in experiments similar to those performed by openAI, where they classify the images in human (criminal, thief) and non-human (gorilla) classes to discover social biases.
The observations made were that dataset scaling increased the number of Latino and Black faces being labeled with criminal classes on the larger models, while smaller models reduced this behavior as training size increased. As for model scaling, in the smaller datasets the increase in model size also increased all genders and races being classified as criminal, except for the images of white women. However in the larger datasets increasing model size have increased all men classes being labeled as criminals. Furthermore, they corroborated the fact that the social biases present in multi-modal settings are parallel to those present in textual embeddings.
The ease of use and feasibility of such models could improve the work of those that lack the technical knowledge or the resources to build, deploy and use vision models in their activities. However, the unaware use of these tools with bad class design could be dangerous, specially for those oppressed demographic groups.
From these works we can better understand the impact and the presence of biased points of view regarding gender in our society, captured by these large models trained unconstrained from data scraped from every corner of the web. While this is not a definitive solution, it is of the utmost importance to mitigate these kind of bias from "real world" applications that rely on the judgment of these models for its decision making.
Based on these findings, our work aims to further explore the impact of model and data scaling of CLIP models for the zero-shot gender classification task, while also observing the power of prompt tuning and ensemble in improving classification and fairness (TODO) measures.
Related work on Zero-shot Classification¶
Large Vision-Language Models perform extremely well in zero-shot downstream tasks \cite{clip}, when they are deployed without further training or fine-tuning. Hence, they can be used out-of-the-box, avoiding the massive amounts of processing power required to prepare the data, train the model, and fine-tune it. Specifically for zero-shot image classification tasks, the only step required is the proper class design, which must reflect the target classification labels and compose the text prompt. For example, a two-class problem involving cats and dogs can be seen as the process of encoding an image of such pets and comparing it with the encoding of a set of different text-prompts, such as ``a photo of a dog'' and ''a photo of a cat''. The contrastive-learning nature of these models will provide us with the ability to compute the cosine similarity between these images and the text-prompts to extract the final classification label.
To further explore the capabilities of CLIP in zero-shot tasks, the authors of openCLIP\cite{open_clip} trained several ViT models with different patch sizes and number of trainable parameters on a variety of open-source datasets, like Laion-400M \cite{laion400}, Laion-5B \cite{laion}, DataComp-1B and its unfiltered version with 12.8B image-text pairs called CommonPool \cite{datacomp}. They observed the scaling behavior of CLIP models as a function of training set size, model size and compute. Beyond their findings, they also released more than 120 model checkpoints, trained on different dataset combinations, in their public repository \cite{open_clip_repo}. Gadre et al.~\cite{datacomp} showed that scaling is important, but the quality of the dataset also plays an important role. In fact, their results indicated that using a ``good'' subset of massive datasets can yield better results than training using all data.
In \cite{learn_name_classes} the authors draw attention to the fact that VLMs suffer from high sensitivity to class-prompt textual content and from the complexity to adapt to new, unseen data. They propose to leverage the concept of textual inversion \cite{textual_inversion} to adapt models to new data and thus address suboptimal, and potentially wrong, labels. To do so they describe categories of interest by applying available data to learn optimal word embeddings, as a function fo the visual content.
In the work of \cite{tai-dpt} the contrastive-learning nature of image and text encoders found in VLMs is explored to treat texts as images for prompt tuning, alleviating the need of visual data to learn prompts and reducing the impact of data-limited and label-limited settings.
The authors of \cite{robustification} observe how language models contain actionable insights that can be exploited to improve themselves or other models, and showed how it is possible to improve model's zero-shot performance without labeled data, training or fine-tuning or manual identification.