Skip to content

Abstract

Multi-modal vision-language models showed promising zero-shot capabilities in a myriad of different downstream tasks, without the need for any fine-tuning nor retraining. Different studies have been conducted to better understand how these models were trained and on what data, and how they could be improved and generalized by prompt engineering. Here we explored the impact of data and model scaling for zero-shot gender classification using openCLIP, and how to employ simple prompt tuning methods to improve these results. We perceived a 2% improvement in accuracy with the scaling of models, and yet we could not see the same behavior when scaling data-sources. For all models, the original OpenAI's WIT dataset achieved the highest accuracy, even though its 30 times smaller than the biggest CommonPool dataset, composed of 12.3 billion samples. We followed by exploring textual prompt ensembling and similarity aggregation techniques that were able to improve baseline results while retaining general accuracy and reducing contextual bias.