A Systematic Bibliometric Analysis of Hate Speech Detection on Social Media Sites

With the increasing availability of internet facilities for everyone across the globe, the internet plays an integral part in modern communication. People have the ease of contacting others and sharing thoughts and ideas quickly. This has raised an enormous amount of spread of Hate Speech on Online Social Media Sites. This paper aims to provide systematic bibliometric analysis and mappings of existing literature for Hate Speech Detection and to identify the existence of Hate speech-related research. Bibliometric Analysis of Machine Learning and Deep Learning articles in Hate, hostile, and abusive speech is considered. This is accomplished using the SCOPUS database, with tools like VOSViewer, Biblioshiny, and ScienceScape. Explored parameters consist of the document type, most active countries, top journals, relevant affiliations, trending topics, etc. It is observed that the current literature on hate speech is concentrated on a specific philosophy. An unexpected need to rectify this situation was evident from this bibliometric analysis due to recent occurrences of hate speech in the digital world.


INTRODUCTION
Digital landscapes have globally altered the face of communication. Increased use of this digitized culture, Internet and digital media have become a room for online hate speech or as a digital hate trigger, online hate declaration and cyberbullying. [1] Often this hatred is targeted over to a community or a person of colour, people belonging to different ethnicity, races or to a religious section of people. Hate Speech is considered a blanket for various offensive, abusive, or insulting user-created content. Uncontrolled sharing and posting of content containing Hate Speech is observed on digital platforms which unfortunately, could result in negative psychological effects for certain individuals. Through exploitation of social networking sites as a venue for public interaction is a two-edged sword. Issues about the frequency of hate speech on the internet have recently grown louder. Off-line brutality and volatility may be seen in cyberspace. Organizations, nongovernmental organizations, and broadcasters are advocating for more conversations, as well as more watchdogs and enforcers to combat hateful speech. Strategies to tackle this are brought through by legislation where several social network platforms were required to sign the Hate Speech code. This required various firms to remove Hate content in less than twenty-four hours but even after this only 0.3 of hostile offenders were charged. [2] With an aim to settle the problem, firms seldom rely on the community itself to report the content present. Absence of systematic automatic approaches and data collection on its occasion made the overall process become a complex one. This is where the researchers and scholars initiate their research in Hate Speech identification. One of the significant hurdles in this task is identification of Hate Speech or Abusive Language in 'Hindi' as a natural language and presence of code-mixing (Hindi-English) on online platforms. Code mixed language, recognising false positives, false negatives along with the trends overtime has become a challenge for the research community for detection of Hate Speech. [3] Natural Language Processing Techniques and development in the Machine Learning Models have brought better insight to this area of research. In a study, researchers were able to identify potential predator activity when it comes to cybergrooming and identification of social media accounts that are responsible for promotion of Hate Speech. [4] Escalated research interest and exploration resulted in regulation of Hate Speech in recent times. Furthermore, an increased demand for research in natural languages other than English is noticed. [5,6] A notable amount of research papers on Hate Speech have been published; a small number of systematic review papers were found during this study. Hence, to the best of our legitimate notion of incitement to hatred answers this question by avoiding discrimination against and isolating a target group, thereby guaranteeing the members' acceptance as equivalent in a society -likewise a precondition for democracy. 'Hate speech' is not a legal phrase; the actual appropriate law to this occurrence is, by distinction, particular to each jurisdiction and well-defined. Thus, if the relevant social media post can disrupt public peace since it targets a group, it targets a group established in some country. The need to operationalize this task as an NLP task is crucial. [7] A decision tree is suitable for data annotation along with directions for amateur annotation. Finally, an analysis of how the labels transferred from the decision tree and their annotation can be operationalized as NLP assignments. [8] The subtasks of target group recognition and targeting act recognition can be considered necessary while being annotatable with adequate dependability by non-legally trained individuals. Their findings suggest that it is possible to technically implement this legal task of Hate Speech detection as an automated classification task. [9,10] Hate Speech against Women Patriarchal behaviour and other social practices have been transported to the internet, manifesting as misogynistic and sexist remarks, postings, and tweets. This online hate speech against women has a severe outcome in real life. Several legal actions have been lately filed against social networks that fail to prevent the propagation of hate comments targeting individuals. [11] A ground-breaking investigation into online hate speech directed towards women focusing on the distinctions and parallels between misogyny and sexism has started to surface with the up-and-coming technology and research interest directed towards this cause. Discrimination against women seems to be an indicator of a negative attitude towards women.
Experiments have shown that general sexist attitudes conceal a hateful sentiment and, in particular, a misogynistic mindset. Even though sexist humour is usually thought to be guiltfree, numerous researches show otherwise. For example, it emphasizes that sexist jokes are perceived as misogynistic assertions are Frenda S, et al. [13] Furthermore, sexist jokes may contribute to the normalization of sexism or misogyny while also harming the target. Hate speech identification was made using a variety of supervised techniques based on word embeddings. Researchers compared the differences between racist and sexist datasets. They discovered that sexist tweets are more participatory and attitudinal than racist tweets. [14] In this challenging environment, an NLP-based technique can detect the two aspects of patriarchal behaviour, misogyny, and sexism. understanding we endeavoured to give a comprehensive quantitative and qualitative appraisal of the scientific landscape of the publications from 2013 until the present early in the year 2021. The all-inclusive bibliometric analysis also provides standard indicators evaluating the outcomes of publications and analysis of keywords, co-authors, and also citations. Use of visualization techniques allowed better understanding and description about the research work. The study will allow researchers in the area of hate speech to record the noteworthy authors, publications, sources, most relevant keywords, impact of their research work, emerging areas and collaboration opportunities for future research approaches. Statistical evaluation of articles, research areas and publications in this bibliometric analysis would provide a thorough insight for the scientific community. Section 'Related Work' reviews the broad categories in which we classified the previous literature work. Section 'Need for bibliometric analysis' talks about the importance of analysis of data for future proceedings. Section 'Preliminary data collection' presents how the data was procured for statistical analysis. Section 'Bibliometric analysis of first search string' and 'Bibliometric analysis of second search string' deals with analysing the documents retrieved by queries. Section 'Observations and Discussions' gives the learnings drawn from the analysis. Section 'Limitations of Current Work' illustrates the drawbacks of the study. Section 'Conclusion' concludes the analysis and Section 'Future directions 'summarizes the future work.

LITERATURE REVIEW
Extensive previous literature has investigated the growth of Hostile speech towards different communities and groups of victims belonging to various caste, religions, sexual orientations, etc. Over the last ten years, researchers are overseeing an exponential curve on minimizing the presence of online Hate. This serves as a source of inspiration for our work and assists us in associating our study to the current research accessible. Much work has been done on small subject areas of abusive and hostile speech targeted towards minority communities and women. Former work has intensively examined the programmed detection of offensive Internet discourse under an assortment of names, for instance: abusive language, profanity, threats, and socially unacceptable discourse. Many legal issues have revolved around this area of work because of different perceptions of "Hate Speech" as a legal term. The literature associated with Hate Speech can be divided into four categories below.

Legal Assessment
The party-political discussion about the suitable answer to the ever-increasing amount of hate speech on social media has led to a consequent increase in the desire to standardize and even more to automatically identify undesired postings. The Analysing the data and finding good results demonstrate that sexist and misogynistic sentiments are expressions of the patriarchal mentality.

Hate against Multilingual Communities
Social media networks have evolved into a forum where users are free to express their thoughts and feelings, perhaps leading to an increase in hate or abusive communications that must be moderated. Most of the research work is present in English as the prime language. [15] Detection of speech profanity in other languages is still a growing research work. Looking at the diaspora worldwide, researchers have gained interest in exploring Hate Speech in various languages.
From a multilingual standpoint, a supervised technique for hate speech identification is more focused on. Several models have been developed, ranging from feature engineering to neural techniques.
Hate speech encourages prejudice against specific groups and hinders equality, an ongoing problem in every civil society.
Immigrants and women are two groups that are disproportionately targeted. [2,13] Several governments and policymakers are currently attempting to address the issue of immigrants, which has been exacerbated by the refugee crisis and political changes that have occurred in recent years, making the development of tools for the detection and monitoring of such Hate particularly interesting. Furthermore, the work employs a bilingual approach, with data for two widely spoken languages, English, and Spanish, available for training and testing participant systems. The diversity of hate targets and languages creates a unique comparative context regarding the amount of data collected and annotated using the same scheme and the outcomes obtained by participants training their systems on those data. Such a comparative situation may help reveal fresh light on linguistic and communicative behaviour concerning these aims, allowing Hate to be more easily integrated. Speech recognition software for a variety of applications. Experiments in a variety of languages have yielded very encouraging results.

Downgrading Racial Bias
Detection Hate speech, coupled with repressive and abusive language on social media platforms, is part of the current effort, which employs complex algorithms to identify racist or violent speech faster and more accurately without the assistance of humans. On the other hand, machine learning models are prone to inferring human-like biases from the training data used by these algorithms.
There is a strong link between annotators' assessments of toxicity and signals of African American English in contemporary hate speech datasets. Existing automatic detection models overlook an essential factor: context. [16] Hate speech classifiers are particularly sensitive to group identities such as "transgender," "black," and "gay," which are merely indicators of hate speech in some cases. Because of this bias in annotated training data and the tendency of machine learning models to exacerbate it, AAE text is frequently mislabelled as abusive/offensive/hate speech by existing hate speech classifiers, with a high false-positive rate. [17] Even when there is annotation bias in the underlying training data, a confrontational strategy is to limit the potential of racial bias in hate speech classifiers. When creating a classifier to predict a target attribute, use adversarial training to devalue a protected attribute (AAE dialect) (toxicity).

Necessity of bibliometric analysis
The bibliometric study helps cover the majority of scientific results. It helps analyse published or evaluated articles and citation analysis to look at how those articles influenced subsequent research by others. As a result, this bibliometric evaluation will provide quantitative insights to upcoming researchers in the field of Hate Speech Detection. Hate Speech Detection is also now more effectively possible using Machine Learning and Deeping Learning models. Bibliometric analysis is a great way to get the current trends, understand what has been accomplished in Hate Speech Detection on online platforms, and analyse other literature to optimize the delivery process. The reason of drastic increase in online Hate Speech is the widespread usage of social media which is a powerful instrument for disseminating Hate and abusive language across all digital channels and platforms. Hate Speech is becoming a topic of research in various languages, needing focus to successfully analyse, detect, and neutralize the hostile impacts of propaganda. [18]

METHODOLOGY AND DATA
The Scopus publication database served as our data source for this study. Scopus is a peer-reviewed database of research publications in science, engineering, the arts, social sciences, medicine, technology, and the humanities. Based on detailed bibliometric analysis on the two datasets obtained, the composition of information and the progress of research on the subject of Hate Speech on social media is examined.
The preliminary data collection component of this study is organized as follows: the first section discusses the preliminary data collection procedure and the search technique utilized for data extraction.
The results of bibliometric analysis and data visualization approaches are presented in the following sections. The report finishes with findings, limits, and recommendations for future research.

Preliminary Data Collection
One of the leading databases for abstracts and citations is the Scopus database by Elsevier. Since 2004, Scopus has been the abode to well-scripted, trustworthy, peer reviews and state-of-the-art research papers that achieve a great level of citations. Scopus has approximately 36,378 documents from roughly 11,677 publishers, of which 34,345 are peer-reviewed journals in top-ranking subject disciplines. It is also developing as a platform that brings researchers, research concepts, and associations together. The data resource for our research work is the standard and reliable Scopus Database.
Our search procedure was broadly split into three sections: data compilation, data mining, data evaluation, and visualization. The time duration for the search was decided to be from 2015 to 2021. For this study, visualization tools used were Scopus and Bibliometrix, an R package utilized to understand the information obtained from Scopus.

Creating the keyword search queries
The main objective of our bibliometric analysis is to map out patterns and trends in the field of Hate Speech detection literature done so far. A preliminary search was engaged using keywords prominently in NLP, Hostile Speech, and Machine Learning paradigms. [1] Keywords are very crucial for searching appropriate research subjects from existing literature. Specific and precise keywords provide a clear-cut illustration of the occurrence of the topic in the same way as our research problem. For our research work, "Hate Speech," "Machine Learning," "Deep Learning," "Social Media," etc., were used. As shown in Table 1, a total of 6 search queries

Cardiff University 11
Aristotle University of Thessaloniki 7 King Saud University 7 The University of Jordan 7 Dublin Institute of Technology 6 Évry 6 Georgetown University 6 Kempelen Institute of Intelligent Technologies 6 Taibah University 6 University of Central Florida 6 and their outcomes are summarised. In Table 1, search query number 3 resulted in 268 related papers from searching the Scopus Database. Fifty-three related documents were extracted by search string number 1, pinpointed to the research area's language domain, "Hindi." Clear from the results in Table 1, not much study has been performed for Hate Speech Detection in Hindi as a natural language because of various reasons like lack of datasets to train and test the machine learning model and scholars' interest in Hindi dialect. [19,20] Preliminary search outcomes using search queries Via proposed keyword search, data collection was achieved using query strings, which helped retrieve 268 papers for the first query search and 53 documents for the second query search. The primary findings are summarised in Table 1. Table 2 shows the selected queries and the document results from the Scopus database. Table 3 represents the top ten most relevant affiliations; Cardiff University is at the top with 11 articles. Table 4 Figure 2. Could be referred for the same. Figure 3 illustrates the subject area and the number of documents per subject. Related papers were available primarily from the subject matter of 'Computer Science' with 244 documents followed by 'Engineering' with   69 documents. Data visualization tools are used to make data easier to interpret and read trends, patterns, and outliers. [21,22]

Selection of search strings
For our bibliometric analysis, we chose two search strings.
The first query is as follows: ("Hate Speech" OR "Hostile Speech" OR "Abusive") AND ("NLP" OR "Machine Learning" OR "Deep Learning" OR "Neural" OR "LSTM") AND "Social Media." The first search string has primary keywords like "Hate Speech", "Machine Learning", and "Social Media" and secondary keywords like "Hostile Speech", "Abusive", "NLP", "Deep Learning", "Neural" and "LSTM". This search string was chosen for further bibliometric analysis to gain insights into solving and removing Hate Speech existing on social media sites using different techniques like machine learning.
The second query is as follows: "Hate Speech" AND "Hindi." The second search string has primary keywords of "Hate Speech" and "Hindi." This search string will provide research publications falling into Hindi as a natural language for hate speech detection. Current research reveals English is one of the pre-distinguished dialects analysed from the point of view of cyberbullying and online hate speech. [14,22] However, there are insufficient widely existing and available datasets in different languages that could pace the growth of research in this field. Other communities might benefit from the removal of Hate from their native dialect. Recent research challenges include producing massive trustworthy datasets in different languages (Hindi) because online Hate is a prevalent dilemma.
Some overlapping results may be present for the queries executed on the Scopus dashboard.

Bibliometric Analysis of first search string
In this methodical bibliometric learning, discussions were conducted to identify the year-wise trends, quantitatively analyse, define scope, and provide a possibility for better collaboration and exchange of ideas among the research community. Analysis of keywords, collaboration, recognition of various sources of publications, research interest over the years, and co-citation of works was analysed.
The bibliometric analysis is further sectioned into 1.
Author, Keyword, and Journal Analysis

Citation, Document, and Location Analysis
Clustering and Co-occurrences Figure 1 shows clusters and associations of co-occurrence between author keywords collected using the Scopus Database. Out of 512 keywords, 27 met the threshold of a minimum of five occurrences of a keyword in a document. Total 4 clusters were formed depicted by 4 different colours, with prominent keywords being "Hate Speech," Machine Learning," "Deep Learning," and "Sentiment Analysis" the connection of curved lines shows the researchers' interest in these topics concerning these keywords. Node size determines the author keywords' tally.
Co-citation means the papers, or the authors are cited collectively. For the author co-citation analysis in Figure 2, 154 authors met the threshold criteria, out of 7792 authors, keeping the minimum number of author citations as 20.  Figure 4 shows the citations according to the source. As observed, 152 documents fit the threshold, out of 152 sources, with the minimum number of documents of a    Author, Keyword and Journal Analysis        Figure 9 shows the top keywords in a particular year, from 2013 to 2020. It is observed that "Hate Speech" and "Machine Learning" keyword count has increased in recent years. Figure  10 shows that the Hate Speech detection research is presented regularly in the conferences. Few journals like IEEE Access, IEEE Internet Computing, and the International Journal of Scientific and Technology Research have recently published hate speech detection research.
Observing the citations of authors, Vishwakarma DK is recognized to be having the most significant number of locally cited documents, with an overall number of 40 citations in Figure 11. rise of diverse social media platforms, hate speech has become an additional concern. The quantity of papers is escalating exponentially over the last 4 years. In below Figure 14, in 2020, the maximum number of articles were published.

Statistical Analysis
In the literature published from 2013 to 2021, 19 subject areas were included, referred to, and worked upon hate speech. Computer Science has the highest number of documents (244). Since most hate speech is prevalent in the digital world, Computer Scientists and researchers are working on various tools to recognize and eliminate such content. Engineering has 69 documents of such research work. As shown in Figure  15, it demonstrates the distribution of subject areas in the research area.
The cited references per year depict the significance of research in this field. Figure 13 shows a quick rise in citation from the year 2015. This commemorates curiosity held by researchers in the analysis of Hate Speech. As said earlier, it can be attributed to the spread of online Hostile and Hate Speech and the development of machine learning. In 2018, a peak was observed with a statistical number of 1129 references cited in publications. Figure 14 shows a rapid increase in this research area as many papers have been published from the graph below. We noticed that the number of documents and conference articles in this research section had grown exponentially in these current times. With the increase in user activity and the    Citation, Document, and Location Analysis sources of publications, research interest over time, and cocitation were carried out in order to learn more about this research gap, motivating many more researchers to enter this field and contribute to it.
The bibliometric analysis is further sectioned into 1. Statistical Analysis 2. Author, Keyword, and Journal Analysis

Statistical Analysis
Hate Speech Detection has not been explored more in the Hindi language. There has been an insufficient number of articles and documents published in this particular area of Hate    having many studies revolving around Hostile Speech detection, NLP, and Machine Learning topics.
In this systematic bibliometric analysis of the second string, discussions were conducted to discover the year-wise trends and define scope and subject areas that further probe into depth. Evaluation of the most commonly used keywords, global collaboration of authors, and recognition of various Speech research. As evident from Figure 19, there has been a recent pick of pace with detection language being "Hindi." The year 2020 has seen the highest number of publications of 27. Working on this research gap would bring our fruitful contribution to the table. Figure 20 shows the gradual increase from 2018 to 2020 in this field of work, with "Hindi" being the natural language of detection for Hate Speech. With the rise in the "Machine Learning" and "Deep Learning" techniques, it has been easy 51 documents belong to "Computer Science" subject area followed by "Engineering" and "Arts and Humanities" which is clearly indicated from the figure.         In recent years, machine learning, deep learning, and natural language processing (NLP) techniques have attracted academics' attention to work on detection difficulties in order to achieve a more efficient result with fewer downsides. Hate Speech detection in Hindi dialect lies in a research gap due to the lack of datasets available. From Figure 23, this thematic map reveals that approaches like "AI Approach" and "Sentiment analysis" lie in the niche theme for providing the solution to the problem of online hate. Researchers are starting to also work on code mixed datasets with the increasing use of code-mixed style on online social media platforms.  Figure 12 and Figure 19 indicate increased year-wise growth in publications in this field. The graph for word dynamics depicted in Figure 17 illustrates the use of machine learning, speech detection, deep learning, and long short-term memory in a smaller number of studies, indicating a possible research gap which can be explored further by fellow researchers to investigate different possible ways to detect and dissolve online hate with technologies like natural language processing and machine learning. Conferences are the favoured publication approach by the researchers working towards this domain. India and the USA lead in Hate Speech research which is evident from the concentrated areas in the world map presented in Figure 18. From the second search string, we can draw a conclusion as not much study is present in Hindi as a dialect. But there has a growing trend observed in statistical analysis in Figure 19, with 2020 having the highest of only 27 publications. There was a steep rise seen between the years 2018 and 2020, which is a positive sign for growing research in this gap.

Limitations of current work
According to the findings of this study, knowledge on inciting online hatred is rising at an accelerating rate in the computer engineering, decision sciences and medicine sector, and it is currently one of the most active and fastest-growing previous research and research activity in the social sciences as well as the technology domain. The quantity of articles in 2020 demonstrates that there is a great deal of interest in the issue among social critics and scholars. The researchers of previous articles are certain that the 2021 January Capitol Hill Storming event, as well as the subsequent social media prohibition and termination of such profiles, will spark more curiosity and study in this field. It should be mentioned that both developed countries and countries trying to develop are actively conducting research on this issue. India, a country rich in variety and home to a plethora of different religions, is increasingly concerned about strategies to combat hate speech.
Hate speech analysis literature may be seen focusing on numerous elements such as hate speech directed at diverse groups of individuals in society, such as cultural minorities, women, and people of various sexual orientations. At this moment, the need for more advanced tools and strategies to solve this issue is critical.
Scientists have concentrated on identifying posts that express a disparaging view about a recognised personality of the country in more published findings. Intentionally posting anything with hostile intent against an individual has far-reaching authors in the area. The authors with affiliations to Cardiff University, Aristotle University of Thessaloniki, and King Saud University contributed the majority of the publications. The keyword analysis refers to social media studies of hate speech aimed at exposing issues such as prejudice against women, racial bigotry, and hatred directed at multilingual populations.
The goal of this work was to use bibliometric analysis to show the current state of Hate Speech detection research on social media. Our findings revealed that such experiments are becoming more refined and are being featured in top publications, as well as cross disciplinary journals involving linguistics and politics, education among other disciplines. In a research of social media channels, Twitter came out on top as the most popular digital site and one of most normally employed language in social networking sites exploration was English.
As social media has become an increasingly important part of our lives, study into the analysis and detection of profanity appears to be more important than ever. The novelty of this work is that it effectively demonstrates a thorough bibliometric analysis of the topic. Bibliometric tools such as VOSviewer and Biblioshiny are used to create a mind map, co-occurrence, co-citations, Sankey plot, and world cooperation map.

Future Directions
This paper aims to provide a clear understanding of the characteristics and research potential for detecting hate speech in social media. This analysis summarizes prior work in this field of research for academics interested in contributing to this field of research. To assist other studies and research projects in a strong direction by using all available analysis and data to refer to future trends. Based on a thorough examination, we believe that the research on Hate Speech Detection on social media will be fruitful in computer science. We anticipate that additional detailed studies on hate speech identification on online social media platforms in Hindi will be conducted.
Comparisons of different topic models in identifying dominating themes and issues in a particular research subject might be noteworthy and valuable in future work.