Detecting misogyny in Spanish tweets. An approach based on linguistics features and word embeddings

https://doi.org/10.1016/j.future.2020.08.032Get rights and content

Highlights

  • Social computing and sentiment analysis technologies can be applied to misogyny detection.

  • Performance of misogyny identification improves when combining linguistic features and word-embeddings.

  • Release of a balanced corpus in Spanish regarding misogyny.

  • Differences among dialects and cultural background in Spanish hinders misogyny identification.

  • Offensive language, grammatical gender, and grammatical errors and misspellings are discerning linguistic features.

Abstract

Online social networks allow powerless people to gain enormous amounts of control over particular people’s lives and profit from the anonymity or social distance that the Internet provides in order to harass other people. One of the most frequently targeted groups comprise women, as misogyny is, unfortunately, a reality in our society. However, although great efforts have recently been made to identify misogyny, it is still difficult to distinguish as it can sometimes be very subtle and deep, signifying that the use of statistical approaches is not sufficient. Moreover, as Spanish is spoken worldwide, context and cultural differences can complicate this identification. Our contribution to the detection of misogyny in Spanish is two-fold. On the one hand, we apply Sentiment Analysis and Social Computing technologies for detecting misogynous messages in Twitter. On the other, we have compiled the Spanish MisoCorpus-2020, a balanced corpus regarding misogyny in Spanish, and classified it into three subsets concerning (1) violence towards relevant women, (2) messages harassing women in Spanish from Spain and Spanish from Latin America, and (3) general traits related to misogyny. Our proposal combines a classification based on average word embeddings and linguistic features in order to understand which linguistic phenomena principally contribute to the identification of misogyny. We have evaluated our proposal with three machine-learning classifiers, achieving the best accuracy of 85.175%. Finally the proposed approach is also validated with existing corpora for misogyny and aggressiveness detection such as AMI and HatEval obtaining good results

Introduction

A survey carried out by the Pew Research Center1 showed that roughly four-in-ten Americans have experienced online harassment [1] in social media. Certain groups are more likely to experience some sort of trait-based harassment than others, the most important being those motivated by ethnicity or gender. This survey revealed that women were about twice as likely as men to state that they had been targeted as a result of their gender. Women, and specifically young women, usually undergo sexualized forms of harassment. For example, 21% of women between 18 and 29 years of age report being sexually abused online, more the double that experienced by men; and more than half of them claim to have received explicit images that they had not requested.

According to Kate Manne in “The Logic of Misogyny” [2], misogyny refers to social environments in which “women will tend to face hostility of various kinds because they are women in a man’s world who are held to be failing to live up to men’s standards”. Manne refers to a relevant difference between the concepts sexism and misogyny, as sexism refers exclusively to a discriminatory behaviour between men and women, whereas misogyny creates the artificial terms of good and bad women with the aim of punishing what misogynists consider to be bad women. In Woman-Hating: On Misogyny, Sexism, and Hate Speech, Louise Richardson-self identifies misogynous behaviour with signs of hostility with a coercively function [3]. In this respect, it is possible to distinguish among hate speech discourses that oppress by including certain signs of harassment or intimidation towards stigmatized group members. For example, much of the damage done by the spread of gossip and slander on the Internet causes harm to specific groups. Women specifically suffer from a significant portion of this harmful material that feminists call objectification, which refers to being considered objects for men’s use and abuse. Moreover, it is possible to find misogyny hidden in messages with the aim of discrediting someone because of their gender, attempting to lessen women’s voices as regard an important issue, expecting the accomplishment of certain unfounded stereotypes towards women or even giving women needless explanations, usually in a condescending manner (aka mansplaining), which typically also involves the chronic interruption of women when they are speaking.

We have, in recent years, witnessed the overgrowing usage of social media, in which freedom of expression is subject to basic rules of behaviour in order to guarantee a healthy environment and prevent online harassment. However, the Internet has allowed powerless people to gain enormous power over the lives of particular people by using specific properties on the Internet such as anonymity [4]. Women who have attained success consequently suffer from sexists attacks, thus originating the term known as Violence Against Women in Politics (VAWIP) [5], [6]. Online misogyny has, therefore, been compared to witch-hunting, since both have a similar function: coercing women in order to prevent them for progressing [7].

Social Computing and Sentiment Analysis technologies have a great potential in order to extract critical insights on opinions shared over social networks which can help to identify hate speech and discrimination [8]. These technologies have been applied in multiple text classification tasks such as irony [9] or hate speech detection [10]. If we consider misogyny to be a form of hate speech, then hate speech detectors should perform perfectly well when analysing text containing misogynous traits; but in some cases, misogyny is very subtle and deep and can, therefore, be difficult to distinguish. Moreover, context and cultural differences can complicate this identification to an even greater extent. Despite these difficulties, automatic misogyny identification is gaining relevance as a challenging task, and has consequently been proposed in conferences and workshops related to Natural Language Processing (NLP) owing to its social relevance. However, to the best of our knowledge, there are only two datasets available in Spanish for misogyny identification, signifying that more resources in Spanish are still required for a better understanding misogynous behaviour.

The main contributions of this work can be summarized as:

  • Spanish MisoCorpus-2020. The compilation and classification of a balanced corpus of tweets related to misogyny written in Spanish. This corpus is subdivided into the three main subsets: (1) Violence Against Relevant Women (VARW), which concerns misogynistic messages directed towards specific women who have gained social relevance, (2) European Spanish vs that of Latin America (SELA), which concerns misogynistic messages written in Spanish from Europe and Spanish from Latin America, with the aim of understanding how misogynous content is affected by cultural background, and (3) Discredit, Dominance, Sexual harassment and Stereotype (DDSS), which concerns tweets containing general aspects related to misogyny, such as derailing, rape, or gender violence, among others. The combination of these subsets is the Spanish MisoCorpus-2020, which has been released for use by the scientific community.

  • The development and evaluation of a machine-learning model with which to detect misogyny. This model is capable of identifying tweets written in Spanish based on linguistic features and word embeddings. During the evaluation of the model, we analysed each subset of features with the whole corpus and each subset. The results were compared by applying three machine-learning classifiers: (1) Random Forest (RF), a decision tree classifier; (2) Sequential Minimal Optimization (SMO), a Support Vector Machines classifier (SVM), and (3) Linear Support Vector Machines (LSVM).

The remainder of the paper is organized as follows: Section 2 describes recent works and approaches concerning the detection of misogyny. Section 3 describes the way in which the corpus was compiled and its composition. Section 4 describes the linguistic features employed in our model along with the results obtained from the experiments performed. Section 5 contains a detailed analysis of the results and, finally, Section 6 summarizes the conclusions of the paper and presents future research directions.

Section snippets

Related work

In recent years, emerging technologies in Social Computing such as Sentiment Analysis (SA) have provided the capacity to understand citizens’ attitudes and feelings which has promoted some domains such as marketing or customer support. For example, in [11] the authors combined Google Trends and Twitter enabling decision makers to monitor social networks in real-time as the literature has demonstrated the relationship among Google Trends to real-world phenomena. This new capacity of analysis in

Corpus

This study has been carried out using a dataset concerning misogynous behaviours. We first conducted a review of the bibliography to find existing corpora related to misogyny in Spanish. There are, however, few available public resources regarding automatic misogyny identification, as is reported in the survey carried out by Shushkevich and Cardiff [19]. This survey describes, among other things, the Spanish AMI 2018 dataset from IberEval, described at [13] (see Section 2). This corpus was

Evaluation

In this section we detail which features were extracted in order to train the models (see Section 4.1), and the evaluation of each model with the whole corpus and each subset (see Section 4.2).

Discussion

The experiments performed confirmed that the combination of Linguistic Features (LF) and Average of Word-Embeddings (AWE) improved the detection of misogyny and oppressive speech. The performance achieved outperformed the baseline model based on BoW, as also occurred with AWE and LF, witch attained accuracy of 85.175% and a standard deviation of 1.450. This result suggests that these sets of features are complementary and some of the linguistic phenomena related to oppressive speech cannot be

Conclusions and further work

In this paper, we present and evaluate a model for misogyny identification for the Spanish language based on Average Word Embeddings and Linguistic Features (AWE+LF) using a new corpus compiled with tweets in Spanish. The model obtains an overall accuracy of 85.175% with SMO which outperforms a baseline model based on BoW, and the models based on linguistic features (LF) and the Average of Word Embeddings (AWE) separately. We additionally used our model to the evaluate three subsets of the

CRediT authorship contribution statement

José Antonio García-Díaz: Software, Validation, Investigation, Resources, Data Curation, Writing - Original Draft, Visualization. Mar Cánovas-García: Conceptualization, Validation, Investigation, Resources, Writing -review & editing. Ricardo Colomo-Palacios: Methodology, Validation, Formal analysis, Resources, Writing - review & editing. Rafael Valencia-García: Conceptualization, Validation, Supervision, Resources, Project administration, Funding acquisition, Writing - review & editing,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work has been supported by the Spanish National Research Agency (AEI) and the European Regional Development Fund (FEDER/ERDF) through projects KBS4FIA (TIN2016-76323-R) and LaTe4PSP (PID2019-107652RB-I00). In addition, José Antonio García-Díaz has been supported by Banco Santander and University of Murcia, Spain through the Doctorado industrial programme.

José Antonio García-Díaz received the B.Sc. and M.Sc. degrees in computer science from the University of Murcia, Espinardo, Spain. He is currently pursuing the Ph.D. degree in computer science with the University of Murcia where he is a member of the TECNOMOD (Knowledge Modelling, Processing and Management Technologies) Research Group. His research interests include Natural Language Processing and infodemiology.

References (58)

  • SiaperaE.

    Online misogyny as witch hunt: Primitive accumulation in the age of techno-capitalism

  • LytrasM. et al.

    Social media mining for smart cities and smart villages research

    Soft Comput.

    (2020)
  • CorazzaM. et al.

    A multilingual evaluation for online hate speech detection

    ACM Trans. Internet Technol. (TOIT)

    (2020)
  • D’AvanzoE. et al.

    Using twitter sentiment and emotions analysis of google trends for decisions making

    Program

    (2017)
  • LytrasM.D. et al.

    Big data and their social impact: Preliminary study

    Sustainability

    (2019)
  • FersiniE. et al.

    Overview of the task on automatic misogyny identification at ibereval 2018

  • FersiniE. et al.

    Overview of the evalita 2018 task on automatic misogyny identification (AMI)

  • FrendaS. et al.

    Online hate speech against women: Automatic identification of misogyny and sexism on twitter

    J. Intell. Fuzzy Systems

    (2019)
  • Z. Waseem, D. Hovy, Hateful symbols or hateful people? predictive features for hate speech detection on twitter, in:...
  • LynnT. et al.

    A comparison of machine learning approaches for detecting misogynistic speech in urban dictionary

  • T. Banerjee, A.H. Yazdavar, A. Hampton, H. Purohit, V.L. Shalin, A.P. Sheth, Identifying pragmatic functions in social...
  • CardiffJ. et al.

    Automatic misogyny detection in social media: a survey

    J. Comput. Sci. Eng.

    (2019)
  • V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F.M.R. Pardo, P. Rosso, M. Sanguinetti, Semeval-2019 task 5:...
  • WaisK.

    Gender prediction methods based on first names with genderizeR

    R J.

    (2016)
  • ArmstrongB.C. et al.

    Relative meaning frequencies for 578 homonyms in two Spanish dialects: A cross-linguistic extension of the English eDom norms

    Behav. Res. Methods

    (2016)
  • BernsN.S.

    Framing the Victim: Domestic Violence, Media, and Social Problems

    (2017)
  • AnastasioP.A. et al.

    Twice hurt: How newspaper coverage may reduce empathy and engender blame for female victims of crime

    Sex Roles

    (2004)
  • MintzM. et al.

    Distant supervision for relation extraction without labeled data

  • KrippendorffK.

    Reliability in content analysis: Some common misconceptions and recommendations

    Hum. Commun. Res.

    (2004)
  • Cited by (0)

    José Antonio García-Díaz received the B.Sc. and M.Sc. degrees in computer science from the University of Murcia, Espinardo, Spain. He is currently pursuing the Ph.D. degree in computer science with the University of Murcia where he is a member of the TECNOMOD (Knowledge Modelling, Processing and Management Technologies) Research Group. His research interests include Natural Language Processing and infodemiology.

    Mar Cánovas-García received the B.Sc. degree in computer science from the University of Murcia, Espinardo, Spain. He is currently pursuing the M.Sc.  entitled New Technologies in Computer Science in the University of Murcia, specialized in Intelligent and knowledge technologies with applications in medicine. Her research interests include Natural Language Processing and Big Data.

    Ricardo Colomo-Palacios is a Full Professor at the Computer Science Department of the Ostfold University College, Norway. Formerly he worked at Universidad Carlos III de Madrid, Spain. His research interests include applied research in information systems, software project management, people in software projects, business software, software and services process improvement and web science. He received his Ph.D. in Computer Science from the Universidad Politécnica of Madrid (2005). He also holds a MBA from the Instituto de Empresa (2002). He has been working as Software Engineer, Project Manager and Software Engineering Consultant in several companies including Spanish IT leader INDRA. Prof. Dr. Colomo-Palacios is also an Editorial Board Member and Associate Editor for several international journals.

    Rafael Valencia-García received the B.E., M.Sc., and Ph.D. degrees in Computer Science from the University of Murcia, Espinardo, Spain. He is currently a Full Professor with the Department of Informatics and Systems, University of Murcia. His main research interests are natural language processing, Semantic Web and recommender systems. He has participated in more than 35 research projects. He has published over 150 articles in journals, conferences, and book chapters, 50 of them in JCR-indexed journals. He is the author or coauthor of several books. He has been guest editor of five JCR-indexed journals (CSI, IJSEKE, JRPIT, JUCS, SCP).

    View full text