Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01pz50gz72c
Title: Extracting Empirical Statistics on Gender Bias From Word Embeddings: A Multilingual Analysis
Authors: Abdulhusein, Neamah
Advisors: Narayanan, Arvind
Department: Computer Science
Certificate Program: Center for Statistics and Machine Learning
Class Year: 2017
Abstract: Recent analyses on gender and other categories of bias in word embeddings have focused their efforts on developing algorithmic methods to remove these unwanted biases from the vector space models. However, little work has been done on drawing connections between the biases present in word embeddings and how they can inform us on the state of the environment in which the language is spoken. In this work, we explore the question of whether we can extract empirical information on the world around us from the vector space models of the languages we speak. We explore this question specifically with regards to gender bias. We conduct an intra-lingual experiment to determine whether gender associations of sports words in a language model L can predict the percentage of female participants from countries that speak L in those sports in the Summer Olympics. We conduct two inter-lingual experiments to determine whether the gender score of a language can be used to predict country-specific gender statistics, such as the UN Gender Inequality Index (GII). Our intra-lingual experiments in English, Spanish and Portuguese show highly significant results. Our inter-lingual experiment on predicting UN GII also shows significant potential for using the proposed metrics to predict country and language specific empirical statistics.
URI: http://arks.princeton.edu/ark:/88435/dsp01pz50gz72c
Type of Material: Princeton University Senior Theses
Language: en_US
Appears in Collections:Computer Science, 1987-2023

Files in This Item:
File SizeFormat 
neamah_written_final_report_bound.pdf995.32 kBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.