Machine-learning model fed web content makes racist and sexist associations

Human biases exposed by Implicit Association Tests can be replicated in machine learning using GloVe word embedding, according to a new study where GloVe was trained on "a corpus of text from the Web."

Because GloVe has no experience with material things, the associations are purely statistical. For instance, GloVe has never experienced flowers or insects, but flowers are more associated with pleasant terms, and insects are more associated with unpleasant ones. Same with musical instruments vs. weapons.

Using GloVe also replicated established racial and gender-based biases in language. "In our results, European American
names are more likely than African American names to be closer to pleasant than to unpleasant, with an effect size of 1.41 and
p-value < 10^-8." Below is the dataset (edited to remove untested words):

European American names: Adam, Harry, Josh, Roger, Alan, Frank, Justin, Ryan, Andrew, Jack, Matthew, Stephen, Brad, Greg, Paul, Jonathan, Peter, Amanda, Courtney, Heather, Melanie, Katie, Betsy, Kristin, Nancy, Stephanie, Ellen, Lauren, Colleen, Emily, Megan, Rachel.

African American names: Alonzo, Jamel, Theo, Alphonse, Jerome, Leroy, Torrance, Darnell, Lamar, Lionel, Tvree, Deion, Lamont, Malik, Terrence, Tyrone, Lavon, Marcellus, Wardell, Nichelle, Shereen, Ebony, Latisha, Shaniqua, Jasmine, Tanisha, Tia, Lakisha, Latoya, Yolanda, Malika, Yvette

Pleasant terms: caress, freedom, health, love, peace, cheer, friend, heaven, loyal, pleasure, diamond, gentle, honest, lucky,
rainbow, diploma, gift, honor, miracle, sunrise, family, happy, laughter, paradise, vacation.

Unpleasant terms: abuse, crash, filth, murder, sickness, accident, death, grief, poison, stink, assault, disaster, hatred, pollute, tragedy, bomb, divorce, jail, poverty, ugly, cancer, evil, kill, rotten, vomit.

They also found gender biases, with female names more closely associated with family terms and male names more closely associated with career terms.

• Semantics derived automatically from language corpora necessarily contain human biases (randomwalker.inbo)

Image: Sólveig Zophoníasdóttir