1 Introduction

The rise of Social Media platforms has strengthened the interest of researchers for studying human behavior on different contexts, as they give them the chance of crawling real time data from the users, but also stored or published data during long periods of time (Bayerl et al. 2014). Since most of the content published on the Internet is in text format, it is unsurprising that one of the most frequently used approaches for online pattern extraction comes from natural language processing (NLP). This artificial intelligence area uses a set of computational methods for making human language accessible to computers, and more specifically for giving the computers the ability to understand and generate human language (Eisenstein 2019; Indurkhya and Damerau 2010). NLP techniques are used in both academia and industry for text analysis applications, such as medicine (Wang et al. 2018; Savova et al. 2019; Tiwari et al. 2020), mental health (Calvo et al. 2017; Stewart and Velupillai 2021), economy (Fisher et al. 2016) or crime prevention (Schmidt and Wiegand 2017).

One of the area that has benefited of NLP techniques on recent years is the study of extremist discourse, particularly due to the increasing use of Social Media by different extremist groups. Social Media platforms, such as Twitter or Facebook, have changed the way extremists communicate, recruit and disseminate their ideas (Dean et al. 2012). The rise of groups such as Islamic State or the Alt-right, together with their use of these platforms with different objectives (Jawhar 2016; Aliapoulios et al. 2021), has represented a threat for many countries, specially considering that extremism can facilitate the justification of violent actions to achieve a movement’s agenda (Thomas 2012). This threat led different countries to finance research projects and other initiatives related to the study of the traces that extremists users left online, with the aim of identifying early behaviors to stop them before embracing violent extremism. In fact, during the worst days of the jihadist threat (between 2015 and 2018), the European Union invested in several research projects grounded in NLP to track terrorism and online extremism (Bouzar 2018; Fernandez and Alani 2018; Florea et al. 2019; Torregrosa and Panizo 2018). The core of most of the initiatives aim to counter this phenomenon, detecting and classifying extremist content that could lead people to adopt these ideologies. Machine learning (ML) techniques made a great contribution to this purpose (see, for example Scanlon and Gerber 2014).

After the fruitful period of research from different perspectives aimed to study and analyze the extremism phenomenon, a few systematic surveys have approached the specific relationship between NLP and extremism research. These systematic reviews can be divided in two types. The first type has analyzed NLP contributions to areas conceptually related to extremism, such as hate speech (Fortuna and Nunes 2018) or law enforcement (Edwards et al. 2015). The second type gravitates on extremism, including NLP as a key part of its identification (Aldera et al. 2021; Gaikwad et al. 2021).

The limitations of the first approach is quite obvious, as the phenomenon is not studied directly. For example, while studying hate speech has a direct impact on the knowledge about extremism (as hate speech is used by extremists), the latter is a more complex phenomenon, composed by other discursive characteristics. The reviews under the second approach, while do approach the issue directly, actually present two limitations. On one hand, their content is restricted to the specific task of detection, not covering the rest of the whole data mining process (Gaikwad et al. 2021). On the other hand, their lack of depth when studying the NLP approaches under focus (Aldera et al. 2021), missing to provide a thorough description of the diverse spectrum of techniques used in both descriptive and detection processes.

This article aims to cover the gap left by this prior work and other similar surveys through several contributions. First, it helps the audience conceptualize the concept of extremism and extremist discourse, and describes the concepts that can be related or be a core part of these fields. Second, it places an emphasis on NLP contributions to extremism analysis (including both description and classification/detection tasks), with a more comprehensive and critical approach on the different types of NLP techniques used to date. Third, we list several software resources that can be helpful for future research works on this area. Finally, our work discusses on future trends and challenges that shall be confronted in extremism analysis. To this end, a systematic review is conducted to collect and critically analyze the literature regarding NLP applied to the study of extremism. Five research questions are formulated to orchestrate the contributions of this review:

  • RQ1. What are the current topics and contributions from NLP to extremism research?

  • RQ2. What NLP techniques are used in extremism research?

  • RQ3. How have NLP techniques been applied in the field of extremism research?

  • RQ4. What NLP software tools are commonly used in extremism research?

  • RQ5. Which publicly available datasets/data sources have been used to conduct NLP experiments in extremism research?

Fig. 1
figure 1

Overall structure of the review. Blue color: theoretical conceptualization; Yellow color: literature analysis; green color: tools; Pink: prospective analysis

Derived from these research questions and the process of answering them, the main contributions of the article can be summarized as follows:

  1. 1.

    It provides a general picture of the theoretical foundations behind the concept of “extremism”, discussing its differences and similarities with other concepts that are often confused or misused as synonyms in the literature.

  2. 2.

    It briefly defines the concept of extremist discourse, including some key elements that are present in this type of discourses.

  3. 3.

    It presents an updated picture of the NLP techniques (including preprocessing techniques) used in extremism research, together with an analysis and comparison of their advantages and downsides.

  4. 4.

    It summarizes the different applications that NLP techniques can have on extremism research, such as discourse description and classification. The main ML algorithms used to identify extremist content are also highlighted.

  5. 5.

    It presents different available software tools, together with open datasets and data sources regarding extremism, which can be of utmost help for authors interested in conducting experiments or making advances in this field in the future.

  6. 6.

    It highlights trends, challenges and research directions that can be pursued in this field, supporting them with the conclusions drawn from the analysis.

A summary of the structure of the rest of the overview can be seen in Fig. 1. Section 2 defines the concept of extremism, the differences among extremism and other topics and the distinctive features of the extremist discourse. Section 3 explains how the review was planned and conducted, including the criteria adopted for the inclusion and exclusion of literature, and a brief summary of the process. Section 4 performs a general descriptive analysis of the outcomes of the search conducted, including the trends of publication and the main keywords associated to the articles. Section 5 describes and compares the different NLP techniques used by the authors. Section 6 focuses on the applications of these techniques, dividing them in two approaches: text description and text classification, including the ML algorithms used for this task. Section 7 lists open NLP datasets, data sources and tools used by the authors. Finally, Sect. 8 answers the research questions, presents future trends, challenges and directions of the area, and draws final conclusions.

2 Contextualizing the concept of extremist discourse

The definition of extremism has traditionally led to different misconceptions in the literature, specially for authors with low background on social sciences. This section deals with the different definitions around this topic. Section 2.1 analysis the differences between extremism and radicalization, two concepts that are frequently used indistinctly (Schmid 2013). Section 2.2 briefly presents other concepts related to extremism, including definitions and relationships with it. Finally, Sect. 2.3 presents how the concept of extremism will be used in this article, including an operationalization of the extremist language that will act as a framework on which the different articles reviewed can be compared.

2.1 Extremism and radicalization: differences and similarities

The literature so far shows that extremism and radicalization are often used as synonyms or exchangeable terms to refer to the same phenomenon, which engenders the false idea that both terms mean the same. However, while authors do not usually distinguish between them from a methodological perspective, there are indeed theoretical differences that make both terms conceptually different from each other. Actually there is no academic consensus about the definitions of extremism and radicalization (van de Weert and Eijkman 2019). However, the different perspectives concerning their relationship can be summarized in three main approaches:

  1. 1.

    Both concepts are synonyms: This could be related to the use of both terms in the political discourse, which has transformed them into pejorative concepts that are used indistinctly (Schmid 2013).

  2. 2.

    Both concepts are different, but one of them subsumes the other: In this line, several articles use the concept of radicalization as a term to refer to the psychological process previous to the involvement in terrorism and extremism (Schuurman and Taylor 2018).

  3. 3.

    Both concepts are different, without a necessary relationship among them: Regarding this approach, Botticher (Bötticher 2017) conducted a deep analysis of the historical roots of these concepts, in an informed attempt at defining the differences among them. Essentially, the term radicalization was born during the 18th century as a way to define a movement against the establishment, but not inherently violent or positioned against democratic values. Meanwhile, the concept of extremism refers to an anti-democratic movement, and stands against “all those who do not embrace its dogmatic recipe for a transformation of society”. Another reference to this article can be found in Schuurman and Taylor (Schuurman and Taylor 2018), which highlight that radicalization, understood in its historical context, does not necessarily imply a negative connotation of “change” of the socio-political order, while extremism does.

When it comes to the present review, it is necessary to have an open position towards the three different approaches. Extremism will be considered the core concept of this review, and therefore it will be used as a keyword instead of radicalization, as all the social movements of interest for this article are, essentially, those against democratic values. However, due to the misconception or confusing use of both terms in the literature, both radicalization and extremism will be used as keywords to conduct the search on the databases during the article gathering process. Accordingly, we will include articles from authors considering both terms as synonyms, as well as those using one as part of the other.

2.2 Extremism and other related concepts

Similarly to the extremism and radicalization terms, there are other concepts that are currently confusing on their use in the context of extremism research. While some of these terms are quite related, they do not share the same theoretical definition. Figure 2 shows the overlapping between different concepts usually related to extremism, graphically representing different possibilities: actual overlapping (blue), absence of relationship (purple), characteristics of the extremist discourse (yellow) or concepts partially related (green).

Fig. 2
figure 2

Graphic representation of the overlap between extremism and concepts usually mentioned in the same context. A deeper analysis can be found on Table 1

Table 1 explains the differences displayed on Fig. 2. This explanation includes the term’s definitions, their difference with the concept of extremism and an example from the literature regarding them. Taking into account that the main characteristic to classify a movement as extremist is that it goes against democratic values, we can find three different types of concepts related to extremism in this table. The first two terms (supremacism and sectarianism) are actually subtypes of extremism, since they are both different types of ideological movements that aim to suppress or limit certain fundamental democratic values of other social groups. When these ideological movements against democratic values resort to violence to achieve their objectives, it can be said that they constitute a type of terrorism (third term in the table). Finally, the last three terms (polarization, fundamentalism and nationalism), although are related to extremism, do not necessarily share its opposition against democratic values.

There are other concepts that, despite apparently related to extremism, are just manifestations of the violence and discrimination underlying this concept. Some examples are hate speech (Olteanu et al. 2018), racism (Fuchs 2016) or stalking/cyber-stalking (Kruglanski et al. 2020). The creation of fake news (Spohr 2017; Bozarth and Budak 2020) and its relationship with extremism currently represents another rising problem that has attracted the attention from the research community.

Table 1 Concepts, definitions and distinction from extremism

2.3 Definition and operationalization of extremist discourse

Until now we have presented a distinction between the concepts of radicalization and extremism, choosing the latter as a key concept to justify the aims of this article. Also, extremism has been compared to other concepts that tend to appear in related studies. As has been stated, this term can have different meanings depending on the approach considered by the author, and this is why its relevant to establish a clear definition to depart from. In this review, our definition of extremism will be “an ideological movement, contrary to the democratic and ethical values of a society, that uses different methods, including violence (physical or verbal) to achieve its objectives”.

Following this definition, a second step would be to clarify what this article refers to as extremist discourse. While it could be conceived as “the use of language held by people when expressing their extremist views”, several authors have highlighted several features that characterizes an extremist narrative from a regular discourse. These features, derived from different works (Ashour 2010; Bennett Furlow and Goodall 2011; Fortuna and Nunes 2018; Sakki and Pettersson 2016; Torregrosa et al. 2020), can be summarized as follows:

  • Types of extremist narrative: there are several ways by which extremist narratives justify their vision and objectives. Ashour (Ashour 2010) divided these narratives into five categories: political, historical, socio-psychological, instrumental and theological/moral:

    • Political: the discourse includes references to grievances from one or more groups towards other groups.

    • Historical: legitimization of the political grievance narratives through the use of historical examples and similes.

    • Socio-psychological: glorification of acts against the system, either violent or not.

    • Instrumental: justification of the violence and “self-defense” as a way towards reaching objectives.

    • Theological/moral: legitimization of actions or reactions against political grievance or social oppression through religion, morality and/or ethics.

  • Linguistic style: the narrative styles or topics mentioned previously build upon a specific vocabulary and style that help extremists structure their discourse. Several articles have unveiled differences on the linguistic style from radical and extremist texts compared to a regular sample of texts (Cohen et al. 2014). For example, the higher use of first and third person plural pronouns, a more negative tone or the use of more words related to negative topics are common to these texts (Torregrosa et al. 2020).

  • Use of discursive resources such as hate speech, otherness or war narrative: extremist texts tend to use discursive resources to convey their actions and ideas towards others. Some of these techniques have been studied in depth, such as hate speech (Fortuna and Nunes 2018), otherness (Sakki and Pettersson 2016) or the use of war terminology to create “enemies” and to communicate a “call to action” to others (Bennett Furlow and Goodall 2011).

In this vein, both the definition and operationalization of extremist discourse have been stated. This type of discourse is characterized by the use of specific narratives, an aggressive and polarized linguistic style and several techniques oriented to justify a feeling of superiority or inferiority towards another group. Considering this, NLP techniques can be exploited to analyze texts and detect and describe useful insights in order to determine when a user is holding this type of discourse. The next sections of this review will elaborate on how the community has used NLP to analyze extremist discourse on Social Media, and the outcomes reported in the reviewed studies.

3 Methodology

This section describes the process carried out to conduct the survey of the articles that apply NLP to extremism research. This process was conducted through a systematic approach, retrieving all the articles from four scientific databases: Scopus, ScienceDirect, IEEE Xplore and Web of Science.

The process to conduct the screening process and the review of the articles followed the next steps:

  1. 1.

    Search in the databases. Concerning the thesaurus used for the search, it was decided to use both the extremism and radicalization terms in the search queries. The reason for this decision was that, as stated before, it is quite common that authors use these concepts as synonyms (Bötticher 2017; Schmid 2013). Second, while the thesaurus “Natural Language Processing” was included, we also decided to extend the search with different subtopics, such as “Sentiment analysis”, “Topic detection” and “Semantic analysis”. Eventually, and due to the recent contributions from the field of deep learning to natural language processing (Young et al. 2018), the subtopic “Deep learning” was also added to the search. Therefore, the thesaurus finally included in the searching process is: (“Natural Language Processing” OR “Sentiment Analysis” OR “Topic Detection” OR “Semantic Analysis” OR “Deep Learning”) AND (“Extremism” OR “Radicalization”) No bounded time span was selected when conducting the review, so that the articles returned from the search can be published in any year. The extraction was conducted in January 2021, resulting in 729 documents from the different databases. Table 2 shows the distribution of articles found per queried database. After deleting duplicates and filtering out non-scientific articles (e.g. indexes), 675 articles remained in the literature corpus.

  2. 2.

    First screening: title, abstract and methodology. After the search process, an extensive screening of the articles was conducted, which consisted of checking the title, the abstract and the methodology to find out whether the retained articles met the inclusion criteria of the review. This criteria can be summarized as:

    1. (a)

      The documents must empirically apply NLP to extremism description or classification.

    2. (b)

      The analysis conducted on the documents must be quantitative.

    3. (c)

      The documents must clearly state the NLP techniques they use to conduct the analysis.

    4. (d)

      The documents must present a clear methodology, including all the scores and the process they followed to conduct the analysis.

    5. (e)

      The article must be written in English.

    After this general screening process, 70 documents were finally held for further review.

  3. 3.

    Second screening: article’s content. A second more exhaustive review over these articles was performed, carefully reading the content of each document and excluding those that were confirmed not to accomplishing the criteria presented above. After the second screening process, 6 additional articles were discarded. The remaining ones (a total of 64) were finally included for the review.

  4. 4.

    Analysis of the selected articles and input extraction: the final step of the review process was the analysis and systematic comparison of the outcomes obtained from the different articles regarding extremism and NLP. Next sections condense and summarize the information obtained from this process.

Table 2 Articles extracted from the different databases that apply NLP to extremism research

4 General descriptive analysis of the articles

This section presents a general descriptive analysis of the articles finally included on the review. Firstly, a general introduction is presented where the publishing years and the types of extremism detected are reviewed. Then, to identify the most relevant topics related to NLP that deal with the selected articles, a textual analysis has been performed using their indexing keywords. This description will be also used to structure the following sections of the paper, as it shows a general picture about the main topics addressed by the reviewed contributions.

Analyzing the timeline of the reviewed publications and the type of extremism under analysis, it can be observed that the interest in applying NLP to study extremism has been increasing sharply during recent years. This is shown in Fig. 3, which in turn supports the rationale and ideas given in the introduction of this review: most articles were published during or after 2015, which overlaps the time lapse when ISIS was more active.

Fig. 3
figure 3

Type of extremism addressed by the articles included in the survey

Besides, as stated in Fig. 3, the most frequently addressed type of extremism in the reviewed articles is jihadi extremism, with a significant gap to the rest of types. In general terms, there are 5 types of extremism approached in the literature: religious (all of them concerning jihadism), political (far-right) political mixed (concerning far right/far left), war (concerning conflicts in different countries, such as Afghanistan), and mixed (studying both religious and political extremism). Since 2015, the number of works that use NLP to identify extremism have substantially increased. In this last period, while jihadi extremism has attracted more interest, political extremism remains relatively steady. Therefore, it can be concluded that the two predominant types of extremism have been the religious and political ones.

Fig. 4
figure 4

Word cloud of keywords extracted from the analyzed articles

We proceed with our preliminary analysis to determine the more common topics associated with the thesaurus used in the search for the articles. To this end, a textual analysis of the keywords related to the reviewed articles has been performed. For this purpose, Fig. 4 depicts a word cloud with the top 30 of the most frequently used keywords by the articles (keywords used as thesaurus were excluded from the count). As can be seen, keywords can be grouped under 4 similar directions:

  1. 1.

    The different NLP approaches in use (e.g. topic modeling, sentiment classification or semantics).

  2. 2.

    The source of the analyzed data (e.g. Twitter, social media, YouTube, web pages, or Dabiq, a jihadi magazine), as well as specific tools that can be used (e.g. Linguistic Inquiry Word Count).

  3. 3.

    Different key terms related to extremism (e.g. terrorism, ISIS, far-right, extreme right, hate speech, online radicalization, or radicalism).

  4. 4.

    The applied methodology, including classification techniques (ML, classification, logistic regression or feature engineering).

It is relevant to note at this point that, while not the objective of this survey, social network analysis (SNA) appeared as one of the most used keywords (both social network analysis and related variants, such as Social Networks and Network Analysis) (Bello-Orgaz et al. 2016; Camacho et al. 2020), which emerges as one of the concurrent approaches when conducting NLP analysis.

5 NLP techniques for extremism research

The main objective of NLP techniques is to transform free text into structured data by capturing its lexical, syntactic and semantic information to acquire or infer new knowledge. Considering this, the NLP process can be divided into two main phases:

  1. 1.

    Text preprocessing: simplifying and preparing the text for its further analysis.

  2. 2.

    Feature generation: transforming the text into a structured data representation suitable to be used by the different computational methods of analysis.

Text preprocessing techniques are quite acknowledged in NLP, and therefore their review is out of the scope of this paper. Any researcher interested in these techniques can further read in Indurkhya and Damerau (2010) and Aggarwal (2018).

After preprocessing the textual data, different text mining techniques are used to transform tokens into structured data by capturing its lexical, syntactic and semantic information. These structured data can be eventually used as input for the different algorithms to acquire or infer new knowledge.

Table 3 Summary of NLP techniques for feature generation used in the reviewed literature

Table 3 presents all the techniques mentioned on the review, together with the articles included on the review that have been applied them as part of their methodological approach. These techniques can be grouped into three different categories according to the type of captured linguistic information, which are explained below in detail in the following subsections. A first descriptive analysis of the techniques is conducted for each of these subsections. Afterwards, a comparative analysis of these techniques is carried out within the area of extremism research, stressing on the advantages and disadvantages of each technique within this specific domain.

5.1 Lexical or vectorial based features

The tokens extracted from the preprocessing phase have to be transformed into more complex data structures representing a final textual features to be further processed. For this purpose, different techniques of text representation modeling can be applied. Vector space models (VSM) (Turney and Pantel 2010) is one of the most widely text representation used in classical NLP approaches. The idea of the VSM is to represent each text or document as a set of points in a space (a vector in a vector space) based on the token extracted. After the tokenization process, the first step to generate this type of representation consists of defining the weighting technique to compute the tokens (terms) appearance’s frequency in a text. The articles reviewed mention several different techniques to generate this vector representation:

  • N-grams: tokens of size 1 are obtained from preprocess the free texts, which means that represents only one word. However, sentences generally contain compound terms (such as living room or coffee machine) formed by several words with a single meaning. The use of grouping multiple tokens together to represent that inherent meaning can be very beneficial for subsequent NLP tasks. This is indeed what n-grams models enable (Sidorov et al. 2012). A uni-gram is any single element of the text, whereas a bi-gram or a tri-gram is composed by two or three elements, respectively, which appear sequentially on the text. Skip-gram is a special version of n-gram: it works in the same way, but considering tokens that are not necessarily juxtaposed in the text. Therefore, an analysis based on n-grams considers n elements as a single token. One of the main advantages of this approach is that high “n” sizes provide contextual information for words (Fortuna and Nunes 2018). Table 4 summarizes which type of n-grams are in use in the reviewed articles, where unigrams are not shown since, as mentioned above, they would be 1-sized tokens that are already elicited by preprocessing techniques.

  • Dictionaries: they are pre-established lists of lexicons (words or sentences) used for filtering or grouping the preprocessed tokens. Therefore, any term found inside the lexicon is considered as a final token to generate the final text representation. Dictionaries can also group the frequency of terms as a whole token, thus calculating the frequency of occurrence of a dictionary itself. The main advantage of the dictionaries is that they capture concepts defined by different terms. By contrast, they are also very vulnerable to words that are not previously included in the lexicon and to the continuous change of language.

  • Term frequency (TF): is the more basic weighting technique in NLP, and consists of the raw sum of the occurrence of each token found in the text. It can be represented as tf(t, d), wherein t denotes the number of times a token appears in document d.

  • Term frequency-inverse document frequency (TF-IDF): it is an evolution of the aforementioned TF. While the TF just sums the frequency of occurrence of a token in a text, TF-IDF also divides it by the frequency of occurrence of a word in the whole corpus. When a word is more frequent in a text than in the set of texts, it means that this word is relevant for the text, and therefore it is given a higher score. It is useful for discriminating between relevant words and words with no relevant meaning, such as stop-words (Fortuna and Nunes 2018).

  • Dichotomous appearance: it represents the presence or absence of a token. Therefore, it is computed as 0 if the term does not appear, and 1 if the term appears.

  • Log-likelihood (Dunning 1993): it is used to compute the significance of the co-occurrence of two variables (for example, two tokens or a token with the group used for classification). Therefore, this technique does not account for the frequency of a single token, but for the frequency of two conditions appearing together, which may include one or two tokens.

Table 4 Type of n-gram model used in the reviewed articles

Focusing on Table 3, the first point to be highlighted is the high use of n-grams and dictionary techniques, exceeding 25% in both cases. This is due to the fact that, from the text preprocessing phase, tokens of size 1 are obtained representing the text. In many cases, before applying more complex techniques that transform such tokens into complex data structures, it is beneficial to apply basic NLP techniques. These techniques allow grouping or filtering the tokens by aggregating them at a first level of lexical information.

The major advantage provided by the n-grams approach is that it is independent from the text. This means that all the text can be vectorized using these techniques, no matter if they appear on a lexicon or not. This is specially useful when applying NLP to extremism research, as texts usually combine terms in different languages. However, this versatility also poses a handicap: the vectorized terms may have no relevant meaning for the researcher, and therefore extra work must be conducted in those cases to identify which terms are relevant.

On the other hand, the use of dictionaries is helpful to detect and classify tokens into meaningful psycho-linguistic categories (Fernandez et al. 2018; Figea et al. 2016). This is a great advantage in the field of extremism research, taking into consideration the psychological background that motivates extremist behavior. In fact, one of the main dictionary based tool, Linguistic Inquiry Word Count, or LIWC (Pennebaker et al. 2001), was forged with the aim of conducting psychological research from texts. It has been frequently applied to extract psychological insights and extremist slang from extremist texts (Torregrosa et al. 2020). However, dictionaries require a previous effort from the researchers to prepare the lexicons or to adapt them to other languages (Sikos et al. 2014). This last point is specially relevant in the case of jihadi extremism, as texts usually combine Islamic terminology (written in Arabic) with different languages (Sikos et al. 2014).

Continuing with the analysis of the vectorial space models applied in the reviewed articles, TF and TF-IDF are the most used techniques. As stated previously, TF-IDF is an evolution from TF, using IDF to eliminate common terms from the text, leaving behind the less used terms, which can be relevant to discriminate textual patterns (in this case, extremist content). Taking into consideration that several articles from the review conduct filtering preprocessing techniques to eliminate irrelevant terms (such as stop-words), there is not a huge difference among them concerning the extremism research field. The main advantage of these techniques is their simplicity and broad use, which make them the most commonly applied techniques. Unfortunately, they have a great disadvantage: they do not provide semantic information about the terms.

Dichotomous appearance was only used in one article. While it presents a clear advantage (it is straightforward to implement), it has one main disadvantage: as stated in the previous section, some terms are used with different semantic meanings in regular and extremist texts (Fernandez and Alani 2018; Gomes et al. 2017; Saif et al. 2016; Wei and Singh 2018). Analyzing only the occurrence of a term can be poorly informative for the model. Finally, log-likelihood can be used for analysing association among terms, which allows providing more contextual information. However, it is still a very scarcely utilized technique within the extremist field of study. A brief summary of the advantages and disadvantages of all these techniques appears in Table 5.

Table 5 Comparison of vector space model based techniques to generate features in the reviewed articles

5.2 Neural language models (word embedding)

Techniques based on neural models include a set of methods that transform tokens obtained from the preprocessing phase into meaningful vectors through the use of neural networks, allowing to capture the relationship among them (Levy and Goldberg 2014) and, therefore, information about words semantically related. In recent years, the application of these models in the field of extremism research have gained an increasing relevance, as they are useful for retaining information about the semantic meaning of the terms. This is precisely the advantage of this type of models to extract textual features compared with the classics models seen in the previous section. This aspect is specially relevant when applied to classification tasks and the use of deep learning to identify extremist content (Johnston and Marku 2020; Johnston and Weiss 2017). The most common neural models identified throughout the reviewed literature are:

  • Word2Vec: it allows predicting words depending on the context, maintaining the semantic meaning of the sentence. To this end, the model creates a vector related to each word through the use of a single-layer neural network, which can be interpreted as a space embedding. Words that are more likely to appear together in the text will be mapped closer in that space, therefore sharing semantic context (Mikolov et al. 2013). Among the different versions of this technique, the continuous Bag-of-Word model and the Skip-Gram model are arguably the most commonly used ones (Goldberg and Levy 2014; Rong 2014).

  • FastText: developed by Facebook (Bojanowski et al. 2017), it operates similarly to Word2Vec skip-gram, but overcoming two limitations of this model: it incorporates subwords in the embedding process, thereby permitting to include words that were not contained in the original lexicon (Schmitt et al. 2018).

  • GloVe: developed at the Stanford University (Pennington et al. 2014), it relies on the use of a word co-occurrence matrix, to which factorization techniques are applied towards extracting the vectors associated to each word. While Word2Vec reportedly has a better performance than this technique, Glove has the advantage of having more available trained models to work with (Mikolov et al. 2017).

After analysing the usage of these approaches in the reviewed literature on extremism, four different purposes of word embedding methods can be discriminated:

  1. 1.

    To conduct bias analysis (how pejorative terms are related to some entities and not to others) (Ottoni et al. 2018).

  2. 2.

    To check how two texts use similar tokens but with different meanings (Gomes et al. 2017; Kursuncu et al. 2019).

  3. 3.

    To create new lexicons based on an already checked text (Araque and Iglesias 2020; Nouh et al. 2019).

  4. 4.

    To overcome language limitations on extremist detection (Johnston and Weiss 2017).

Regarding the frequency of use of these techniques in the field of extremism, Table 3 reveals that the most used technique (Word2Vec) does not reach 10%, a value much lower than most of the classical techniques based on vector space models. This is due to the fact that this type of approach is becoming of great importance just in the last few years, and it is at the current time when its application to the field of extremism research is growing in momentum.

Only one article reported a comparison among FastText, Word2Vec and GloVe within an extremism classification task. FastText performed slightly better than the other two. However, Word2Vec and its variations, such as doc2vec (Lau and Baldwin 2016) or graph2vec (Narayanan et al. 2017), still outstand as the most resorted pre-trained word embeddings. Table 6 summarizes the comparison among these techniques in the context of extremism research. A brief summary of the advantages and disadvantages of all these techniques is also given in the same table.

Table 6 Comparison of neural techniques to generate features used in the reviewed articles

5.3 Syntactic and semantic features

Some NLP techniques rely on the analysis of data according to a particular context for generating features representing the text (Krippendorff 2018). The type of contextual information depends on the NLP technique under consideration, but common approaches include sentiment analysis, topic detection or semantic analysis, among others. Techniques of this type used by the reviewed articles include:

  • Part-of-speech (POS): it allows tagging every word with its grammatical category (e.g. nouns, verbs or adjectives) depending on the structure of the text where it is found (Cutting et al. 1992).

  • Lexical syntactic feature-based (LSF): it allows capturing the dependence inside a sentence or a text between two terms (Benito Sánchez 2019). These two terms are later compared to determine the context and the direction of the expression.

  • Named entity recognition (NER): it deals with the identification of entities (e.g. names, organizations or locations) in the text, tagging them as relevant subjects (Ritter et al. 2011).

  • Parse trees (PT): it constructs a representation of how the concepts can be used recursively in a sentence. Parse trees include all the tokens and their relationships, along with a set of rules that allows substituting the token while maintaining the syntactic rules.

  • Latent Dirichlet allocation (LDA): it is one of the most popular NLP techniques for topic detection. It extracts topics from a corpus of text based on word probabilities: for each latent topic, it extracts the probability distribution of a combination of words, which helps identify the main topics. (Jelodar et al. 2019).

  • Non-negative matrix factorization (NMF): it is a topic modeling technique which relies on the use of linear algebra algorithms in a TF-IDF document matrix to define topics (Chen et al. 2019).

  • Sentiment scoring (SS): it provides a score for every text unit (e.g. sentence or text) based on its latent emotional valence, with the aim of understanding the authors opinion or emotional state about something (Liu 2020). This score can be computed as dimensional (through a single scoring about the valence) or categorical (specifying which emotions are expressed in the text). Table 7 summarizes how both approaches are distributed among the reviewed articles.

  • Semantic tagging (ST): it implies the process of automatically extracting concepts, entities or topics from the tokens in a text, which can be realized by assorted algorithmic means (e.g. Jovanovic et al. 2014).

  • Word/sentence length: it analyzes the length of the words (based on characters) and/or the sentences (based on words) (Stankov et al. 2010; Yang et al. 2011; Sikos et al. 2014; Weir et al. 2016; Scrivens et al. 2018).

  • Use of emoticons: emoticons are graphical figures to express emotions or behaviors on the text, using a combination of characters (Agarwal and Sureka 2015; Wei et al. 2016).

  • Use of punctuation: this approach involves the analysis of the use of punctuation signs as part of the syntactic distribution of the sentence (Sikos et al. 2014; Yang et al. 2011).

Table 7 Type of sentiment analysis approaches using in the reviewed articles on extremism

These types of techniques go a step further into text representation, taking advantage of the tokens to conduct a more complex analysis. This is specially useful in the extremism research field, where simple term analysis or frequency can be misleading in the interpretation of outcomes, due to the disparity of semantic meaning behind the same term used by extremist and non-extremist groups (Fernandez and Alani 2018).

The first four aforementioned techniques (namely, POS, NER, LSF and PT), are used to analyze, tag and extract information about the syntactical structure underlying tokens. While POS tags each word with its syntactic function inside a sentence, NER is used to identify the nouns and entities present on the text. Then this information is used to determine which nouns from the text are actual people, organizations or locations (Hartung et al. 2017; Saif et al. 2017, 2016; Fernandez and Alani 2018; Bisgin et al. 2019), among others. In particular, according to the articles reviewed in our survey, NER evidenced that using a combination of noun semantic categories was statistically more accurate to determine if a text included extremist content than using token analysis, sentiment or topic features (Saif et al. 2017, 2016). Analyzing the frequency of application shown in Table 3, among these 4 techniques, the most commonly used in the field of extremism is POS (25%), being the rest of techniques used less frequently.

On the other hand, LSF and PT regard the syntax and the dependencies among tokens. In this case, LSF analyzes the relationship between two syntactically dependent tokens (Kim et al. 2017; Masood 2021), while parse trees build representations of several tokens and use their syntactic structure to find tokens combined in the same way (Sikos et al. 2014). LSF was compared to vectorial space models as classification feature, but it did not perform any better than the latter (Hartung et al. 2017).

In what refers to topic extraction, LDA and NMF have been the techniques of choice in many reviewed articles. LDA has the advantage of hinging on a statistical base and to be commonly used in the NLP literature (Heidarysafa et al. 2020). However, as stated in Alizadeh et al. (2019), it performs poorly with short texts (e.g. tweets). Taking into account that most of the articles reviewed use Twitter to extract their extremist datasets, this is an important disadvantage. NMF emerges as an alternative to LDA, presenting more readily interpretable results (O’Callaghan et al. 2012, 2015), and featuring a better performance over short texts (Chen et al. 2019). Notwithstanding these benefits, in the reviewed articles NMF is used much less frequently than LDA (see Table 3).

Adding a topic an “valence score” can help compose a representative idea about the author’s agreement with that topic (Bermingham et al. 2009; Scrivens et al. 2018). For example, two studies focused on Arabic regular population found out that Twitter users’ tone was more negative when ISIS committed a murder, won a battle or made a public call or movement (Mirani and Sasi 2016; Ceron et al. 2019). Sentiment scoring techniques are divided in two different approaches: a dimensional approach, based on a single score, and a categorical approach, based on the classification of tokens inside one or more emotions (such as anger, fear or happiness). A combination of both strategies can be found in some of the articles (Wignell et al. 2018; Figea et al. 2016). These techniques can be employed to measure the emotions expressed in the text, together with the opinion of the writer towards a specific token in the text (Bakshi et al. 2008). The main difference among them is their theoretical basis, but also the way they are applied: dimensional scoring usually involves selecting a token, around which the scoring process takes part. On the other hand, categorical scoring usually classifies tokens depending on the emotion they represent, and therefore are more focused on single tokens. In the case of extremism research, both approaches can be useful, as they can identify how extremist texts approach different topics (Wignell et al. 2018; Macnair and Frank 2018), the valence of their tones (Wei et al. 2016) or the connotations of the terms they use (Chen 2008). Finally, the concept of semantic tagging was used in the reviewed literature to tag tokens with semantic information regarding their context. This strategy, which is very similar to NER (sometimes using it), tags the tokens with entities, but also with concepts and categories (Wignell et al. 2018). Focusing on the use of this type of techniques in the reviewed articles, Table 3 shows that the sentiment analysis techniques are the most used within the techniques to extract syntactic and semantic features, exceeding 37% in the case of sentiment scoring.

The last three techniques elaborate on the analysis of the text formatting characteristics, to build other types of features that capture more information than that provided by the text itself. For example, the length and quantity of texts, sentences or words, the number of characters inside a word, the use of punctuation or emoticons. In all these cases, text characteristic features have been used as a complement to other text features, never as single features extracted from the free text. However, they have shown little impact when describing or predicting extremism in texts, and in general are applied in a marginal fraction of the reviewed works (as can be seen in the last 3 rows of Table 3).

Table 8 presents a summary of all the techniques used to generate syntactic and semantic features showing their advantages and disadvantages both in general application and in extremism literature.

Table 8 Comparison of syntactic and semantic based techniques to generate features for text representation

6 Applications of NLP in extremism research

The previous section has detailed all NLP techniques used in the reviewed works on extremism to process text and generate features as structured data. Depending on the objectives to be achieved in each of the reviewed works, one or several of these generated features are used to acquire new knowledge. In general two main purposes have been identified in the reviewed papers for which they are used:

  1. 1.

    As the input of classification models generated with ML algorithms to discriminate between extremist and non-extremist content.

  2. 2.

    To conduct a descriptive analysis characterizing the extremism: for example, to detect slang that is specific of extremism.

Based on these two main approaches, the next subsections present a descriptive and comparative analysis of the works that undertake each of these purposes, pausing and examining their main outcomes.

6.1 Classification approaches

As can be derived from the general analysis of the reviewed articles presented in Sect. 4, classification is one of the main topics of interest regarding NLP applications on extremism. This is not surprising, as one of the key objectives of this research field is to help law enforcement agencies identify extremist content. More than half of the articles included in the review (54.68% of the articles) made use of one or more classification algorithms, specially during the first years of ISIS activity. As shown in Fig. 5, 2015 and 2018 were the only years after the beginning of ISIS activity in which there are more articles not resorting to classification techniques than articles using them. The common use of classification approaches shows that there was a higher interest in detecting extremism than in defining it.

With the goal of training classification models based on NLP features to discriminate between extremist and non-extremist content, different ML algorithms have been applied in the reviewed works. These works uses ML models to address issues that goes from sentiment analysis (using a pre-labelled dataset) to proper user classification (extremist vs non-extremist). Figure 6 illustrates the frequency of application of every ML algorithm in the articles under review, where it can be noted that support vector machine (SVM) is the most widely considered model, followed by random forest, Naïve Bayes and decision tree (J48).

Fig. 5
figure 5

Frequency of articles using classification techniques versus those not using them

Regarding the model used by each article, Table 9 summarizes what kind of ML algorithms were used by all the articles including classification tasks. It also highlights the NLP features that are directly (or indirectly) involved in the generation of the classification models.

Apart from these classification tasks, five articles conducted other predictive learning tasks. These include the prediction of how the radicalization process takes place (Fernandez et al. 2018), how extremist behavioral changes occur among the members of a group (Smith et al. 2020), the daily level of online recruitment activities conducted by extremist groups (Scanlon and Gerber 2015), the risk of a video to be raided by extremist groups (Mariconti et al. 2019) or the risk of pro-ISIS terms as part of a person’s vocabulary (Rowe and Saif 2016).

Fig. 6
figure 6

Type of ML model used in the literature related to extremism research

We first place our attention on the use of basic features based on vectorial space models, such as n-grams and dictionaries (shown in Table 9). The first ones (n-grams) (Bisgin et al. 2019; Hartung et al. 2017; Kursuncu et al. 2019; Owoeye and Weir 2018; Rekik et al. 2019; Scanlon and Gerber 2015; Sharif et al. 2019; Zahra et al. 2018) have been used more than the second ones (dictionaries) (Ahmad et al. 2019; Araque and Iglesias 2020; Fernandez et al. 2018; Kursuncu et al. 2019).

Nevertheless, it is difficult to determine which of these two techniques performs best. In fact, the study of Figea et al. (2016) found out that there is no relevant difference between using dependent techniques (such as n-grams) or independent (such as LIWC) from the text when creating a classification model. A general limitation from both techniques is that similar terms can be used with different meanings in two texts, leading to confusions during the data interpretation process (Saif et al. 2016; Fernandez and Alani 2018; Wei and Singh 2018; Gomes et al. 2017. This is common in the context of religious radicalization, where religious terms can be used by regular religious texts, but also by extremists texts (Gomes et al. 2017). Although the use of n-grams is a possible way to overcome this limitation, they are a primitive option to keep semantic information (Hall et al. 2020; Sharif et al. 2019). Nonetheless, there are techniques that are more informative than these ones when conducting complex NLP analysis. For example, n-grams were reported to perform worse when identifying topics in radical texts than LDA or dictionaries (Hall et al. 2020).

Table 9 Type of features input to the ML models employed in the reviewed articles

Regarding sentiment features, they are not usually used as a single feature to detect extremist content, specially concerning political radicalization (Scrivens et al. 2015). While these features do not perform bad either and they, in fact, perform better than other less complex features (Ahmad et al. 2019), classification models trained with more features usually perform better than those who use only sentiment features (Weir et al. 2016; Hartung et al. 2017; Saif et al. 2017; Owoeye and Weir 2018, 2019; Araque and Iglesias 2020). In fact, those classifiers based on exclusively semantic features performed better than those based on strictly sentiment features (Saif et al. 2017; Araque and Iglesias 2020). For example, a study conducted by Weir et al. (2016) compared the usefulness of two classification tools, one based on sentiment features and the other using POS feature together with text formatting features such as number of sentences, average length or quantity of characters. The second showed a better performance, but it could be due to the high number of features used in it. Other three articles (Sikos et al. 2014; Yang et al. 2011; Stankov et al. 2010) also utilized text formatting features and other text features, as characteristics to describe and classify extremist content. None of them rendered a significant difference with respect to classifiers that only use features that extract information from the text itself. Contrarily, there are several works which claim that text formatting features (such as sentence length (Yang et al. 2011) or emoticons (Agarwal and Sureka 2015; Wei et al. 2016) constitute a good add-on for improving the accuracy of classification models.

Finally, the best classification outcomes are achieved by using features based on neural models (word embedding). The main purpose of most articles that use features based on Neural Language Models in classification tasks is the detection of extremist content. As other types of features, they are quite dependent on the type of ML algorithm at hand (Masood 2021; Kim et al. 2017; Johnston and Marku 2020; Devyatkin et al. 2017). Nevertheless, they perform specially well when combined with neural networks of different types (Ahmad et al. 2019; Johnston and Weiss 2017). Contributions embracing this type of textual representation as classification features concur in a similar conclusion: features based on neural models tend to outperform other classical features such as vectorial space models (Devyatkin et al. 2017; Kursuncu et al. 2019; Masood 2021) or syntactic and semantic features (Kim et al. 2017; Araque and Iglesias 2020). One article, however, pointed out that word embeddings perform poorly when compared to n-grams when dealing with short pieces of text (Abd-Elaal et al. 2020). As happened with other NLP features, combining word embedding based features with other types of features also gave rise to better classification outcomes than using them in isolation (Araque and Iglesias 2020; Nouh et al. 2019).

6.2 Descriptive approaches

A second application of NLP techniques in extremism research stemming from our literature analysis is the characterization and study of the phenomenon of extremism from a descriptive point of view. Within these works, four different descriptive focus can be established:

  • Terms: descriptive analysis on the terms commonly used by extremists. Stated differently, characterization of the type of extremist vocabulary.

  • Topics: detection of the most common topics discussed by extremist texts.

  • Sentiment: analysis of the sentiment and tone of an extremist discourse.

  • Semantic: analysis of the contextual information around terms inside an extremist text.

Table 10 summarizes the type of descriptive analysis performed for each of the articles reviewed. The simplest descriptive approach focuses on the terms, whereas the inclusion of other approaches (topics, sentiment, semantic or punctuation) adds extra layers to the description of the discourse. This is why the terms approach is central in the literature related to extremism description. In addition, we notice that almost all the rest of descriptive analysis perform a prior term analysis, elucidating the complementary of all these approaches. Sentiment analysis is the only one that is occasionally performed in an independent fashion.

Table 10 Descriptive linguistic approach used by the reviewed articles

Regarding the insights about extremism found in the reviewed works, Sects. 6.2.1 and 6.2.2 highlight the main observed patterns, classified by the two predominant types of extremism found in Sect. 4: religious (mostly focused on jihadism) and political (mostly focused on far-right movements).

6.2.1 Literature insights about religious extremism

The insights obtained from the comparison of the literature regarding religious extremism, focused on jihadism, can be divided in different sections:

  • Terms: When centering the scope of the analysis on common terms used by religious extremism, the name “ISIS” was more mentioned by neutral users than by extremist users (Wignell et al. 2018; Gomes et al. 2017; Bisgin et al. 2019), who preferred the term “Islamic State” or “Caliphate”. The more frequent terms encountered in extremist texts analyzed in the articles were related to religious (e.g. Allah, Jihad or Islam) or geographical references (e.g. Syria, Raqqa, America or Iraq) (Wignell et al. 2018; Gomes et al. 2017; Wei and Singh 2018; Bisgin et al. 2019; Skillicorn 2015). The descriptive analysis of the text also unveiled the common use of specific slang terms, such as “Crusaders”, “Mujahideen” or “Abu” (Gomes et al. 2017; Wei and Singh 2018).

  • Topics: Works carrying out a descriptive analysis based on the topics show that the most frequent topic related to Jihadi extremism was, unsurprisingly, religion (Scanlon and Gerber 2015; Bermingham et al. 2009; Kinney et al. 2018). The most easily identifiable topics in Jihadi magazines were war, geopolitics, religious speech, government and administration (Bisgin et al. 2019). Inspire (Al Qaeda’s magazine) was rather focused on conflict legitimisation and philosophy, while Dabiq and Rumiyah (ISIS magazine) were more focused on the geopolitical conflict (Kinney et al. 2018). Some of the topics, such as recruitment, are reportedly hidden among topics referring to religious and military aspects of the Syria conflict (Scanlon and Gerber 2015).

  • Sentiment: Combining sentiment analysis and topic detection, jihadi women happen to be more extreme than men in their messages related to nearly every relevant topic (Bermingham et al. 2009). Concerning the magazines, most of their texts have a negative tone and recurrently embrace terms related to fear, except when they discuss about topics such as paradise or martyrdom (Wignell et al. 2018; Macnair and Frank 2018). Words such as Allah or Islamic State were also spotted to have negative connotations when analyzed through a sentiment analysis approach. Authors hypothesized that this might be due to their use as a justification of violent behaviors. A study concerning jihadi radical forums also uncovered that the most extremist texts scored more on negative dimensions, using violence and hate terms, than more moderate alternatives (Chen 2008). Finally, a study postulated that radical users that presented a good tone towards ISIS (on their tweets) showed in fact complicity with it (Wei et al. 2016).

  • Semantic: While the descriptive term analysis provides a first insight, it shall be remembered that context can alter the meaning of a token (Wei and Singh 2018). From this perspective, articles focused on semantic discrimination allow checking how these keywords are used depending on the intention of the text. For example, Gomes et al. (2017) stated that the background of the terms “ISIS”, “Islamic” and “Syria” changes as per the origin of the text under analysis (neutral or extremist). A study delving into the divergences of the semantic meaning of words, conducted by Fernandez and Alani (2018), classified terms into different semantic groups (category, entity and type of entity). Similar words were found to be used differently by radical and non-radical users, including the name of radical groups. Entities were concluded to be a good way of discriminating the semantic meaning of a term. Finally, the study of Kursuncu et al. (2019) conducted a comparative analysis between extremist and non-extremist religious users. Their findings resolved that while both groups shared terminology when referring to the religious concept, the extremist group made use of much more terms related to radical Islamism and hate speech. This goes in line with the evidence that token analysis techniques combined with other strategies can be more informative than using them alone.

As can be supported by these insights, and taking into consideration the features of an extremist discourse presented in Sect. 2.3, Jihadi extremism possesses several of these features. Their use of specific slang and expressions, together with a negative tone, shows how they present a specific linguistic style. Also, they endow their discourses with a special emphasis on a theological and moral narrative, but also with the glorification of religious acts of violence against a common enemy (Western society and non-believers). It is difficult to determine how much of their use of war topics relates to a specific narrative or the geopolitical situation of the territories in which they operate. Nevertheless, it is fair to state that war (and its instrumentalization) is a key element in the construction of their narrative.

6.2.2 Literature insights about political extremism

Focusing now on the reviewed works that conduct a descriptive analysis of the terms most commonly used by far-right extremism, and following the same structure as with religious extremism, we find different insights:

  • Terms: An article analyzing an Alt-right community (Torregrosa et al. 2020) reported that they used racist (BlackMagic, WhitesLivesMatter), anti-immmigration (BuildTheWall, IllegalAliens) supremacist (WhiteGenocide, WhitePeople, ChasingDownWhites) and anti-left (AntifaTerrorists) terms and hashtags in their tweets. This work also exposed the use of specific slang to refer to other racial minorities, such as “aliens” to refer to immigrants. Among a sample of videos massively attacked by far-right groups from 4chan,Footnote 1 some of the most mentioned keywords were “black”, “police”, “white”, “shot”, “gun”, “world”, “war”, “American”, “government” or “law” (Mariconti et al. 2019). Other relevant keywords of far-right extremist groups include the mention of the numbers “14” (a reference to the “fourteen words”, one of the most popular white nationalist slogan, coined by David Lane, a member of the white supremacist terrorist group known as The Order: “We must secure the existence of our people and a future for white children” (Michael 2009) and “88” (meaning “Heil Hitler”, as the H is the 8th letter of the alphabet), but also to the genocide, nazism, anti-islamic and anti-jewish groups (O’Callaghan et al. 2012, 2015).

  • Topics: the more common topics discussed by far-right groups were racial topics (Ottoni et al. 2018; Ben-David and Fernández 2016; Alizadeh et al. 2019; O’Callaghan et al. 2015), immigration (Ottoni et al. 2018; Ben-David and Fernández 2016) and war (Ottoni et al. 2018, being very aggressive with these topics (Mariconti et al. 2019). This conforms to expectation, as both racial content, war and immigration are topics commonly found in the far-right discourse (Panizo-LLedot et al. 2019. Interestingly, non institutional groups were more focused on a racial and anti-immigration discourse (Ben-David and Fernández 2016; Klein and Muis 2019) than the institutional far-right groups, such as political parties. Those parties were occasionally found to have a populist discourse directed against the elites (Klein and Muis 2019). The only article analysing far-left groups pinpointed that they discussed about feeling related topics more than other groups (Alizadeh et al. 2019).

  • Sentiment: one of the reviewed articles (Torregrosa et al. 2020) underscored that a higher relevance in a far right community was related to a significantly higher use of negative and aggressive terminology. Similarly, the study of Figea et al. (2016) exposed that words of anger can also be useful to identify emotional concepts related to political extremist content, such as aggressiveness and concerns about other groups. Also, high negative messages were commonly forwarded against Jews, LGBT and black people (specially the first two) (Scrivens et al. 2020). Only one article (Alizadeh et al. 2019) focused on analyzing differences between far-right and far-left discourses, using a dictionary-based approach (both LIWC and Moral Foundation dictionaries). For these purpose the authors combined different NLP features to conduct a descriptive analysis from different perspectives: terms, topics and feelings. As a result, far-right was reported to use more positive words, together with terms regarding obedience to authority and pureness. By contrast, far-left resorted to more negative terms, anxiety words and terms related with justice and harm avoidance. As for the sentiment approach, this study also revealed that both groups used a general negative tone when compared to non-extremist political groups. However, from all the previously discussed outcomes, only words related to the obedience to authority yielded a significant difference.

  • Semantic: Finally, the only reference to semantic analysis in political extremism related articles appears in Ottoni et al. (2018), who discerned that terms from extremist groups tend to be classified in “negative” categories using the semantic tagger from Empath. Among this category, the more relevant terms were “anger” and “violence”.

As it happened with religious extremism, far-right extremism also presented several features of the extremist discourses examined in Sect. 2.3. One of their most relevant traits is their use of specific and aggressive slang to refer to other groups. However, this is not particularly surprising, considering that some of these groups are very active on the Internet. They rely on political and historical narratives to build their discourse, also including a component of “self-victimization” therein. They also draw on hate speech and otherness as discursive resources (specially the first one, compared to religious extremism), and frequently include references to war narrative.

7 NLP dataset and tools

In the analysis carried out in Sect. 4, it was noted that the data sources and the specific NLP tools in use appear frequently as relevant keywords of the works contained in the related literature corpus. This is because these elements are a fundamental part of any research work related to the study of a particular domain, in this case the extremism phenomena. The following subsections present a detailed description of both the data sources and tools used in the reviewed works, so that a complete view of the available resources is given to the audience.

7.1 Datasets and data sources

Collecting a dataset is a key part of any NLP research process. In the case of online extremism, this step becomes specially complex, as most information represents a risk for security and/or anonymity. Therefore, it is often a hard task to find public datasets online capable of providing a solid substrate of information for modeling and/or characterization of extremism.

Table 11 Publicly available datasets for extremism research

Many of the articles included in the review use their own datasets. The reader is encouraged to contact with the authors of the different articles to ask for their data. However, in this section we deal with articles that use datasets that are either public, or can be obtained from their original source on demand. Table 11 shows a summary of the publicly available datasets used by the literature. This table contains the name of the dataset, an approximation to its size (in form of number of samples and users), the original language, the source of the data, articles using those datasets and a bibliographic reference including a link to the dataset itself.

There also exist data sources which are often used to extract textual information, but that require a preprocessing to transform them into valuable datasets for further modeling. Table 12 presents the different extremist magazines used by the literature to conduct NLP analysis. Data retrieved from these sources, however, must be curated before conducting any further analysis.

Table 12 Publicly available extremist data sources

Besides the already mentioned datasets (which are part of this review), other sources might be useful for the researcher interested in obtaining more textual data related to the topics of extremism and radicalization. While these datasets are not used by the reviewed documents, and therefore fall out of this article’s scope, it is important to highlight their existence in order to assist researchers in their search for more publicly available data. As with the type of extremism of the articles in this review, they will be divided into two groups: political and religious extremism.

Concerning political extremism, a dataset of the far-right forum named Stormfront (de Gibert et al. 2018) can be found in a GitHub repository.Footnote 2 Likewise, a dataset of alt-right users was validated by Thorburn et al. (2018), which is publicly available under request to the authors of the study. Moreover, speeches from different political parties can be accessed on the webpage of the Manifesto Project Database,Footnote 3 with textual data corresponding to political parties with different ideologies.

Finally, related to religious extremism, the Global Terrorism Research Project (which is the source from where to download the Inspire magazine cited in Table 11) features much more content than the previously stated magazine, including a higher number of periodicals and datasets.Footnote 4 The same holds for the AZSecure webpage, which gathers datasets from dark web jihadist forums in different languages.Footnote 5

7.2 Tools

While conducting a research work, it is often the case that authors inform about the tools they use for their performed experiments, along with the databases in use, for example, to create a lexicon. This section discusses on the most frequently used NLP tools when studying extremism and radicalization.

Fig. 7
figure 7

NLP tools used by the articles reviewed

Figure 7 illustrates the frequency of use of different NLP tools. Only those being used on three or more articles spawn their own category, while the rest are included under the “others” category. Also, the category “not specified” includes all those articles that do not clarify the software tools they use (Chen 2008; Alghamdi and Selamat 2012; Rowe and Saif 2016; Wei and Singh 2018; Scanlon and Gerber 2015; Hartung et al. 2017; Zahra et al. 2018; Sharif et al. 2019; Fernandez et al. 2018). As observed in this figure, the most frequently used NLP tools are:

  • SentiStrengthFootnote 6: developed in 2010 (Thelwall et al. 2010), this tool was created to analyze the emotional valence (sentiment) of short texts. It uses a dictionary with sentiment related terms, from which it calculates the “strength” of the tone of different expressions. SentiStrength can report binary (positive vs negative), trinary (positive/negative/neutral) and single scale (\(-4\) to \(+4\)) sentiment results. From the reviewed articles, it was the most commonly used tool to determine sentiment (Weir et al. 2016; Scrivens and Frank 2016; Wei et al. 2016; Saif et al. 2017; Owoeye and Weir 2019; Scrivens et al. 2015; Macnair and Frank 2018; Scrivens et al. 2020, 2018).

  • Linguistic inquiry word countFootnote 7: this tool, also known as LIWC (Pennebaker et al. 2001), was created in 2007 with the purpose of studying the language through a psychological perspective. LIWC relies on the usage of pre-established dictionaries (which can be expanded with third-party dictionaries supplied by the researcher) that are used to identify categories of words and psycho-linguistic processes underlying a text (Tausczik and Pennebaker 2010). Eight articles used it to conduct their analysis on extremism (Alizadeh et al. 2019; Hall et al. 2020; Smith et al. 2020; Sikos et al. 2014; Figea et al. 2016; Nouh et al. 2019; Torregrosa et al. 2020; Rehman et al. 2021).

  • OpenNLPFootnote 8: OpenNLP library is a ML based toolkit for the processing of natural language text,Footnote 9 encoded in Java. It supports different NLP tasks, providing several options to analyze texts. Four reviewed articles adopted OpenNLP in their experiments (Scrivens et al. 2018, 2015; Scrivens and Frank 2016; Weir et al. 2016).

  • IBM Watson natural language understandingFootnote 10: this software, developed by IBM, includes several packages at their core, which allow conducting NLP analyzes from different perspectives (for example, open analysis versus questions and answers). This software can apply several NLP techniques to texts, such as semantic tagging, sentiment scoring or keywords and topic extraction. It was used by two articles included in the review (Ahmad et al. 2019; Wignell et al. 2018). Furthermore, the AlchemyAPI software, which was used by another two articles (Saif et al. 2017, 2016), was eventually included in the core of Watson NLU in 2015Footnote 11.

  • Natural language toolkitFootnote 12 (NLTK): it is a NLP Python library created in 2002 (Loper and Bird 2002). It performs very similar NLP tasks than OpenNLP. Four articles used this library (Ben-David and Fernández 2016; Heidarysafa et al. 2020; Kinney et al. 2018; Klein and Muis 2019.

  • Stanford Core NLPFootnote 13: the Stanford CoreNLP is another Java based NLP tool, developed at Stanford University (Manning et al. 2014). It can perform NLP analysis in different languages, and one of its distinctive features is that it is quite easy to set up and run (Pinto et al. 2016). Three articles resorted to this NLP tool (Wei et al. 2016; Kim et al. 2017; Bisgin et al. 2019).

Even though Fig. 7 characterizes the frequency of use of the above NLP tools, other alternatives are used less frequently (namely, by less than three reviewed articles). These tools include WordNet (Bermingham et al. 2009), Stanford Maximum Entropy Part-of-speech Tagger (Bermingham et al. 2009), Vader (Wei et al. 2016; Torregrosa et al. 2020), WMatrix (Prentice et al. 2012), Gensim (Ottoni et al. 2018), iSA (Ceron et al. 2019), the Arules Package (Rekik et al. 2019), MALLET (Hall et al. 2020), the Language Detection Library for Java (Agarwal and Sureka 2015), POSIT (Weir et al. 2016; Owoeye and Weir 2018), TextRazor (Fernandez and Alani 2018), Language Model Toolkit (Mariconti et al. 2019), ConcepNet (Mariconti et al. 2019), TensorFlow Vocabulary Processor (Johnston and Marku 2020) and the Python-based tone analyzer API (Ahmad et al. 2019).

8 Discussion and conclusion

This review has aimed to thoroughly explain the contributions so far of NLP to extremism research. To this end, we recall that the literature survey has been geared towards providing an informed response to several research questions posed in the introduction, regarding the different NLP issues under analysis. Throughout the whole article those issues have been analyzed, both descriptively and comparatively, based on a literature corpus included under specific targeted criteria. This last section rounds out the overview by undertaking three different purposes: the answers to the research questions mentioned previously (Sect. 8.1), a summary of future trends, challenges and directions (Sect. 8.2), and a brief conclusion with an outlook (Sect. 8.3).

8.1 Answers to research questions

The different research questions regarding the state of the art in NLP for extremism research were formulated in the introduction as a means to drive the methodology and analysis of the surveyed literature. Once this analysis has been completed, these research questions can now be answered by virtue of the insights drawn from the review process conducted in previous sections. Figure 8 depicts a schematic summary of the conclusions reached after the exhaustive review, highlighting the main findings for each of the research questions. Each of these answers is explained in detail below.

Fig. 8
figure 8

Diagram showing the main items of the replies to the posed research questions

  • RQ1. What are the current topics and contributions from NLP to extremism research? In light of the reviewed literature, with no doubt an upsurge of NLP approaches have been applied to extremism research over the last few years. Religious extremism is the most covered topic, followed by far-right extremism. Terrorism (specially Jihadist terrorism) and counter-terrorism appear to be the key motivational factors behind the interest in these topics, as detecting extremist content can help prevent radicalization processes and, thereby, avoid attacks as the ones experienced in recent years (Johansson et al. 2017). The interest in extremism detection is clearly reflected on the many mentions to ML algorithms, as their combination with NLP approaches can be useful to create classification models that allow for the accurate identification of extremist content. Finally, even though it is beyond the scope of this review, SNA also appears as an analytical approach commonly linked to the study of language in extremism research.

  • RQ2. What NLP techniques are used in extremism research? Section 5 has disclosed that n-grams, TF, TF-IDF and sentiment analysis are the most commonly used techniques to study the extremist discourse. It is foreseeable that the first two approaches emerge as the most frequent in the literature, taking into account that they embody a previous step to conduct more complex analyzes, for example sentiment analysis itself. However, it should be considered that the use of neural networks models (word embeddings) is developing fast inside the community working on the study of extremist discourses, hence it should be embraced as a good starting point for researchers and newcomers interested in this topic. This is specially relevant, as authors have pointed that detecting the most commons terms used in the specific domain is not enough to understand in which meaning they are used in the text under consideration. Therefore, techniques capturing information about the context and the meaning of the terms (e.g. embedding or semantic tagging) must be considered as an important constituent part of any textual analysis in prospective contributions. This statement becomes even more substantiated taking into account that extremist texts rely on words from regular discourses, but with different objectives.

  • RQ3. How have NLP techniques been applied in the field of extremism research? As stated on Sect. 6.1, 54.68% of the analyzed articles performed classification tasks using ML approaches. Again, this was expected, as the main objective of extremism research is to detect that content in advance. Among the ML algorithms in use, SVM stood out as the most commonly used one, followed by random forests, Naïve Bayes and decision trees. In terms of classification accuracy, most experiments with SVM yielded in general good performance levels. However, in the most recent research works, approaches extensively relying on different flavors of Neural Networks performed particularly well when compared to other models, and should be underlined as a promising trend for the detection of extremism based on ML classifiers. The rest of the articles (see Sect. 6.2) focused on describing the main features that differentiate between regular and extremist texts, towards uniquely defining this type of discourse. This prompted insights that could be helpful for future researchers when identifying which textual features are more useful to analyze, to detect (and eventually prevent) extremism in Social Media.

  • RQ4. What NLP software tools are commonly used in extremism research? Section 7 pointed out SentiStrength as the most frequent tool to conduct NLP analysis. Specifically, this tool is used to perform sentiment scoring through the automatic tagging of words around a given token. The second one in terms of usage frequency is LIWC, a tool based on dictionaries with a psycholinguistic approach. Two further remarks should be made at this response. First, 25 articles did not report any details about the software tool utilized in their experiments. Secondly, 17 articles employed a software tool used by less of three articles. Therefore, while several NLP software tools conform a timid trend in regards to RQ4, it can be concluded that there is not a clearly dominating NLP tool in the literature for undertaking NLP analysis.

  • RQ5. Which publicly available datasets or data sources have authors used to conduct NLP experiments in extremism research? Most of the articles included in the review drew on their own private datasets when conducting their experimentation. However, some of the datasets—specially those concerning religious radicalization and Twitter, forums or radical magazines—are currently publicly available to the general community. A summary of these public datasets, together with supplementary datasets suggested by the authors, have been presented in Section 7.1.

8.2 Future trends and challenges

The research questions and their answers given above draw a general yet detailed picture of the current state of the art of NLP for extremism research. However, the literature analysis made as a requirement to inform such answers has also given rise to a manifold of insights and prospects of the future of this research area. This section outlines the future trends that the literature will follow departing from its current state, as well as the challenges that will be faced by the community will encounter and possible directions to tackle those challenges effectively.

Fig. 9
figure 9

Future trends and challenges of NLP approaches applied to the extremism research area

Figure 9 shows a schematic summary of such trends and challenges. As shown in this figure, there are three main trends rooted on the research questions, and three future challenges for the NLP applications to extremism research, which are next explained in detail:

  • Future trends:

    1. 1.

      The relevance and global interest in political extremism will grow fast in the short term. At the time this survey is written, the Capitol assault and the shutdown of Parler (a social platform famous for being used by pro-Trump movements) have placed the political extremism under the attention of both the general public and the research community. In fact, several datasets related to online political extremism are released on a continuous basis, paving the gap towards studying this phenomenon in depth. These studies may leverage the lessons learned from the study of religious extremism. Therefore, while there are other extremist movements that will draw attention from researchers (e.g. groups like the Incels, or Involuntary Celibate, as stated on (Voroshilova and Pesterev 2021), we foresee a vibrant research activity around the detection and characterization of political extremism in future years (see, for example, Scrivens et al. (2021)).

    2. 2.

      When it comes to ML for extremist prediction, neural network based techniques have showcased promising performance levels in some of the reviewed works. However, the literature approaching extremist classification with this modeling choice is relatively scarce (see also Gaikwad et al. 2021). Together with the continuous evolution of new neural architectures, the proliferation of new contributions resorting to modern neural networks is arguably a very promising trend for the future of extremism research (in similar areas, such as hate speech, a search on Google Scholar regarding deep learning shows 4 times more bibliography than with extremism). Particularly, the use of NLP approaches hinging on Deep Learning architectures (based on neural language models) also offer an effective way to overcome the lack of semantic information extracted from the texts, which is a key challenge in the study of extremist discourses (In fact, some of the upcoming research trends involve the increased use of technology based on embeddings, such as (Alatawi et al. 2021; Araque and Iglesias 2021). However, the black-box nature of these neural network models will span opportunities for explainability techniques (Arrieta et al. 2020. We envision that the explanation of these models will represent a turning point for the use of this type of approaches, issuing extended information about what these sophisticated yet opaque models observe in texts to elicit their predictions, and eventually leading to extended insights on the extremist discourse.

    3. 3.

      Multivariate classification models (those fed with different types of features for discriminating among extremist and non-extremist texts) achieve in general better results in the reviewed papers. Furthermore, the general analysis carried out in Sect. 4 elucidates that some works adopt elements from social network analysis (SNA) to pursue research studies in the area of extremism. Such elements, which essentially build upon the analysis of interactions among users, could be a good complement for the study of extremist dynamics in online environments (Camacho et al. 2020). Indeed, approaches combining NLP and SNA have been investigated in other research fields, such as fake news (Zhou and Zafarani 2020), and more recently in the extremism area (Torregrosa et al. 2020), yielding good results. Also, combining these techniques with an extra layer in the analysis, such as time (e.g. Theodosiadou et al. 2021), the information obtained might be enhanced and more useful for the researcher. Therefore, it is our belief that the adoption of approaches combining techniques from different areas will take a relevant step in the analysis of extremist behaviors, not only based on textual discourses, but also on the interaction dynamics held in online social media.

  • Future challenges:

    1. 4.

      The presence of multiple languages in a given extremist text is a known limitation of the research area, which occurs with particular recurrence in religious extremism. This shortcoming, which is very common in this type of texts, cannot be solved just through the use of NLP techniques that may be too “simple” to overcome this problems (such as the use of n-grams instead of dictionaries). In this line, the new advances on word representation learning in NLP are enhancing the representation of the semantic information of words in word embeddings (Pilehvar and Camacho-Collados 2020), will contribute to a fine-grain processing of extremist documents. Likewise, we find in the last years a progress in the development of cross-lingual word embeddings (Søgaard et al. 2019) that represents the semantic knowledge of words from more than one language. Cross-lingual word embeddings have shown their capacity of creating a common vector space model for several languages in different tasks, as word translation and sentiment analysis (Camacho-Collados et al. 2020). Hence, the use of the last advances in cross-lingual word embeddings will alleviate addressing the challenge of multilinguality in extremist texts.

    2. 5.

      Delivering interpretable explanations for the decisions issued by classification models is one of the most important challenges currently prevailing in the area, due to the psychological, criminological and sociological roots of extremism. The interest in detecting extremist content is not only justified by the detection itself, but also by the extraction of insights to gain a deeper understanding of the mind and behaviors of extremists. If this understanding is supported by explanations of the reasons why a extremist text is detected as such by a model, and if such explanations are made understandable for an audience that does not necessarily have any background in NLP, discourses can be characterized, and first signs of extremism can be identified. This can be realized by resorting to algorithmically transparent classification models (e.g. decision trees), at the cost of a potential loss of accuracy when compared to more powerful yet non-interpretable modeling counterparts, as well as by leveraging the ultimate advances in explainability and interpretability for NLP reported in recent workshops specialized in the matter (Kumar et al. 2021). Achieving a good balance between these modeling choices as per the needs of the extremism research area is a challenge that must involve multi-disciplinary views to reach a consensus on what interpretations are needed from the models to properly understand extremism in all its forms.

    3. 6.

      Finally, the relative scarcity of public data sources will abide as one of the most challenging aspects to deal with in extremism research. Although massive data can be extracted from online platforms such as Twitter or web forums, the ethical concerns related to anonymity and the private nature of most data stored in such platforms prohibit researchers from sharing their datasets. This ultimately entails the creation of new datasets every time a new experiment is conducted, instead of enriching already stored datasets with new information. Therefore, creating and sharing datasets with other researchers, always respecting the ethical clauses imposed in this regard, will smooth the arrival of new researchers and teams to this field, improving the quality and quantity of the research results. We note, though, that the community is still far from this utopian stage.

8.3 Conclusions

Currently, extremism represents a security and ideological challenge for Europe. Different kind of movements, such as jihadi terrorism and far-right groups, have changed the political and social agenda of several countries, including hot topics that are now discussed as relevant issues for those countries (Ali 2021). To confront this phenomena, it is first necessary to understand the discourse, which is a reflect of the ideology of extremist groups. Only through this understanding these movements can be prevented and counteracted.

NLP offers effective technical resources to describe these discourses, together with ways of extracting insights regarding how extremists use language compared to non-extremist groups. This review aims to achieve this objective by providing the reader with the description of:

  1. 1.

    The extremism itself and the concept of extremist discourse.

  2. 2.

    The NLP techniques used to analyze texts.

  3. 3.

    The different applications of these techniques.

  4. 4.

    Software tools and extremism datasets.

These manifold aspects of NLP for extremism research have been critically approached towards identifying future research directions, relevant trends and challenges to overcome in the study of extremist discourses (such as the need for explainable models and cross-lingual NLP techniques). Considering the insights extracted from the review, we hope that the directions, trends and challenges given in this work suggestively encourage future studies aimed at the detection and characterization of the extremist discourse in texts.