Detection of Jihadism in Social Networks Using Big Data Techniques Supported by Graphs and Fuzzy Clustering

ThisisanopenaccessarticledistributedundertheCreativeCommons


Introduction
Social networks are playing a very important role in the way people think. When accurately targeted, repeated messages can reinforce political ideas or even flip the way of thinking of the most indecisive. In this regard, Jihadism has been identified as one of the movements that relies the most on social networks to spread propaganda and try to influence the public opinion. The Madrid bombings of 2004 [1] are used as a case study, where they analyze grassroot jihadist networks and how terrorist organizations use collective action from local level to cause enormous impact.
Social networks are also used by terrorist organizations as a tool for recruiting new members. Sentiment analysis to detect radicalization has been applied to social networks in the past as an evolution of previous analysis that were traditionally focused on websites and forums [2]. The problem of nodes that play an important role as influencers or that spread propaganda and the way in which it is propagated is a growing area of research [3][4][5].
A very challenging part of the analysis presented in this paper is how to measure the impact of each user in the network, as it depends on the volume of tweets (activity) combined with the number of followers but is also amplified by the number of retweets. For this purpose, a deep analysis is carried out using graphs. General theory of networks and graphs, in particular, have been used for social network analysis (SNA) as one of the most relevant tools [6].
Labelling users as influencers, followers, or neutrals is also very difficult and false positives or false negatives of a standard classifier may yield dramatic consequences. Missing data can be corrected by applying genetic algorithms able to predict the absent text, as in the case of missing answers in questionnaires [7]. However, in the current research the problem is more related to ambiguity as a result of unreliable data or contradictory terms in the messages. In this case advanced text mining or natural language processing techniques would be appropriate [8][9][10]. Nevertheless, due to the fact that the messages to be analyzed are mostly translations from Arabic language and dialects to English, it was decided not to put a big amount of effort into natural language, because of the risk of ending up modeling the translation process more than the original meaning of the messages.
For the purpose of assigning profiles to the users, the proposed methodology utilizes fuzzy clustering techniques that provide probability of classification for each possible profile. Fuzzy clustering has been successfully applied in semisupervised environments [11], in combination with the classic k-means clustering method [12], and more specifically to detect malicious components [13]. In this paper the fuzzy clustering method takes as an input the results obtained from the graph analysis, along with some characteristics directly extracted from the social network.

Description of the Methodology:
Architecture Based on Graphs and Fuzzy Clustering 2.1. Big Data Architecture. A big data architecture is proposed with the goal of monitoring Twitter in real time being able to predict threats either by detecting changes in the profiles, or by detecting changes in the level of activity. The system can retrain itself to update profiles and classifications patterns, while maintaining its detection capabilities. A big data approach is very suitable for this kind of real-time analysis, especially on social networks such as Twitter in which messages are generated continuously and the system must collect, analyze, and archive-or-discard them [14,15]. The proposed implementation was simulated using Kaggle's databases plus Twitter extraction API for demonstration purposes and to refine the algorithms based on graphs and fuzzy clustering (as shown on Figure 1).

Fuzzy Architecture to Isolate Suspicious Profiles.
Using the previous works as inspiration and [13], we have designed a system capable of locating those profiles hidden at first sight but prone to modify their behaviour based on the influence received. In order to do so, we have considered a set of tweets as the source of information to measure the impact of an influence user in others.
Our process has followed five steps. First, the information was acquired using Kaggle's database and Twitter user's accounts by extracting their tweets. Then, that information was filtered to select the most and the less active users (understanding active users as those that both generated information or actively retweeted information from others). Next, we established those parameters with the potential to differentiate users, like sentiment analysis of the messages, users followed, and some others. Next the network graph was created, using the relationships among users to apply centrality measures in order to obtain new parameters that will serve as additional inputs to the fuzzy clustering process. Finally, the system will show those hidden users with an undefined profile, susceptible to be traced in the future. Figure 1 represents the interaction of these five stages.

Tweet Extraction Techniques.
We have used several sources of information to get as many profiles as possible to perform the study. As primary source, we have used Twitter and Kaggle and, as secondary sources, several forums with terrorist ideologies and the official website of ISIS, known as Wafa Media Foundation, as seen in Figure 2. Primary sources were to collect the base information to be studied, to define the profile of the users actively spreading terrorist content while secondary sources were used to get familiar with the vocabulary used by these users in social propaganda. Many  keywords were obtained in this analysis to perform Twitter searches in a second step.
To download tweets from users speaking of terrorism (in favour and against), a connection between API Rest of Twitter and API Streaming was established. API Rest was used to check if the users whose information is being extracted had active accounts. API Streaming was used to download new users in real time and so expand the knowledge base about terrorists or potential targets. Although previous studies on network evolution show that social networks properties tend to reach an equilibrium [16], studies focused directly on terrorism are able to detect new trends and platform changes [17].
This way, we have expanded the suspicious profiles that we had from the initial databases "Isis fanboy" and "About isis" (Kaggle), with profiles that follow and spread the information and users included in these databases. The probability of obtaining users and repeated content is very high, as many times downloads belonged to followers of the downloaded users retweeting the same content (friends of friends sharing the same news, opinions and so). This was understood a clue to consider that we were searching information in the right direction.
This set of data was already registered and classified by the level of risk and continued downloading information associated, indicating the existence of active accounts, which allowed us to continue researching about it.
The extraction and analysis were focused on a social circle, in which we defined as "popular users" those with highest number of followers + publications + impact, serving as basis to categorize others that we already had. For this purpose, we retrieved several fields as (i) Location is one attribute that we retrieved but have not been used, as we have not considered it as relevant information. In Twitter, as in many other social networks, location can be removed, or even faked. (ii) Tweet ID is the identification of a user on Twitter, though in this analysis we have not taken it into account as we have the username of the profile.
(iii) Time: date and time of the tweet. We used it to measure the periods of higher activity in an interval of five months. Once those maximums were located, we contrasted the dates to check if those days had any correlation with some political, social, or economic ISIS event.
With this information, we built the database shown in Figure 3.

Information Preprocessing.
With the previous fields, we have discarded the fields name (many times is fake) and location (many times undetermined), using the rest to generate new knowledge. In particular the following variables were introduced: (i) Frequency: using the field date, to calculate the interval within the user sends a tweet.
(ii) Sentiment: analysis of the sentiment of tweets of each user to polarize them in positives and negatives. For this analysis the Python library nltk and Vader Lexicon were used, specialized in natural language processing.
(iii) Extraction from each tweet the mentioned or retweeted users, by using regular expressions and Python.
From this new dataset, a new filter was applied in order to locate those active users generating tweets and named or retweeted by other users (most impact users). From this set of users, the set of connections by means of a graph was computed.

Graphing the Network.
We have created several graphs based on different metrics that would lead to different interpretations. The objective was to analyze and visualize social relationships among users and communities. Once the graph was created, as the one displayed in Figure 4, we have applied indicators of centrality to identify the most important vertices based on several criterions, to detect as well as the most influential users those who receive Tweets.csv Complexity 5 information to broadcast it, or those who are following many influential users. To do so, we have weighted every link based on the number of retweets or mentions to a user (inside of the message @user, @user2. . .and so). We have not taken into account the sentiment of the message nor the frequency of shipment; these parameters will be used later in the fuzzy clustering procedure.
Most influential nodes are important in graph analysis, but many times in social communities those users are detected and located. Other criteria can be more important like nodes likely to be the most direct route between two influencer's nodes, or key nodes to reach the rest of nodes. That is why we have used centrality measures, to get different properties of a network and its behaviour. The more a node is centred, the more important it is. In particular, we have used two geometric measures, one being path based and the other one being a spectral measure to evaluate the influence and connections of a node within its community.
(i) Degree Centrality Measure is defined as This geometric measure is used to find users very connected, those with the highest number of links with other nodes on the net. It takes into account the weighting on edges. We are not focusing on this type of users, as usually this measure could serve as a measure of popularity among nodes, though it is important to evaluate their relationship with the rest of nodes.
(ii) Closeness is defined as ( ) = (| | − 1)/ ∑ ( , ) This geometric measure represents the importance of each node according to how close is to the rest of nodes. Nodes with a high value tend to be very well connected to the most relevant nodes on their network and are perfect to broadcast information. They do not have to be very influential but are very active followers who broadcast information on the net and are very close to the most influential nodes.
(iii) Betweenness centrality is a path based measure defined as where is the total number of shortest paths from node to node .
(V) is the number of those paths that pass through V.
It indicates the nodes included in the shortest path between most of the nodes. For this work, this was a very interesting measure because it could highlight those nodes that serve as bridge for influential nodes. In this case, these nodes might not have many followers but can connect many relevant nodes.
(iv) The last measure to be applied is the eigenvector, which takes into account the number of links of a given user, as the number of links of its connections, and so could be considered as a hierarchical measure that computes not only your connections but the connections of whom you follow. Its value for node V is given by the V ℎ element of the eigenvector related to the first eigenvalue of the adjacency matrix of the graph [18], As seen in Figure 5, the results obtained applying eigenvector and betweenness are quite similar in distribution. In this case measures that are not correlated are used as input for the fuzzy cluster to avoid redundant information and overtraining the cluster giving more relevance to some variables than others.
We have chosen eigenvector as input for the clustering part as we are dealing with a structured problem, where there are people that train other users to broadcast information and so in a hierarchical model. The measure that better reflects this way of behavior is the eigenvector.

Fuzzy
Clustering. Soft Computing techniques have been used in many different fields to deal with imprecision and uncertainty [19,20]. In our problem, segmentation techniques are unsupervised methods used to classify information in groups created from similarities among individuals. The potential of the segmentation algorithms to show underlying structures in data can be applied in different fields such as classification, image processing, pattern recognition, modeling, and identification.
Segmentation techniques can be applied to quantitative or qualitative data. In this paper only quantitative ones will be used, to build the data matrix, which will have records as columns and measured variables as rows. By segmenting this data, the users are grouped in base to their similitude, understood from a mathematical point of view and defined as the "distance" among data or according to some prototypes of the group. This group depends, therefore, on the individuals being grouped together and on the definition of distance.
Within these segmentations we found two approaches: (i) Hard Clustering: Objects belong to just one segmentation. Different groupings are excluding.
(ii) Fuzzy Clustering: This grouping technique applying fuzzy logic [21] allows different objects to belong to different segments simultaneously, but with different membership degree [22]. In many cases, this segmentation is more logical than the previous. In our case, a user can be interested in terrorism, though it has not been catalogued as dangerous, but with an elevated belonging degree in this group.
Therefore, fuzzy segmentation or fuzzy clustering is applied in this work, taking into account the nature of our problem with an objective function to obtain the optimal number of partitions. This optimization will lead to applying some nonlinear optimization algorithms to find a local minimum.

Fuzzy Clustering.
Groupments in this algorithm satisfy the following conditions: where is the number of groups and is the number of records.
Fuzzy space for our data is the set defined by

Fuzzy Clustering c-Means.
This algorithm is based on the optimization of Fuzzy partitions [23,24].
where is the membership matrix [ ] ∈ of our data and = [V 1 , V 2 , . . . , V ] are the vectors characterizing the centers of these groupings for which we want to minimize our functional.
The standard between our centers and the data is given according to the following: The parameter ∈ [1, ∞) determines the fuzziness of the segments. The value of the cost function ( , , ) can be interpreted as a measure of the deviation between points V and centers .
The minimization of this functional leads to a nonlineal optimization problem that can be solved through different methods as genetic algorithms or iterative minimization. However, the most popular for this application in particular is the iterative method of Picard.
The restriction of membership values is imposed by Lagrange multipliers.
We can demonstrate that to minimize the functional it is necessary that Therefore, the parameters to be determined in the algorithm are as follows: (i) Number of clusters: this parameter is the most important and the one with greatest impact in the segmentation. If the number of groups to be divided our data is known, this parameter would be determined. We will determine the number of clusters though the fuzzy partition coefficient (FPC). This validation measure indicates how well our data are explained by this grouping; that is, the membership to each one of the segments of our data is, in general, strong and not fuzzy. (ii) Fuzziness parameter: parameter affects significantly fuzziness in the segmentation. As it approaches 1, grouping ceases being fuzzy to be hard, and if it tends to ∞ it will be completely fuzzy. We have chosen the value ( = 2), as being the most used in bibliography. (iii) Termination criterion: as it is an iterative algorithm, it is necessary to establish a termination criterion to stop iterations. In our case, we have set 1000 iteration or reach an error lower than 0.005.
(iv) Distance matrix: the calculation of distance implies establishing the scalar product matrix. The natural election is the identity matrix ( = ) but a distance matrix that is very extended is the inverse of the covariance matrix of the data, leading to the Mahalanobis standard.
The norm used influences the segmentation criterion changing the measure of dissimilarity. The Euclidean norm leads to hyperspherical groupings in the coordinate axes, while Mahalanobis leads to hyperelipsoidal groupings in the axes given by covariances between variables. In addition to these parameters, in bibliography we can find several modifications of the algorithm: (i) Modifications that use an adaptive distance measure, as the algorithm of Gustafson-Kessel [25] and the fuzzy maximum likelihood estimation [26].
(ii) Algorithms relaxing the condition on the probability of belonging to each segment (∑ =1 = 1, 1 ≤ ≤ ) indicating a level between each one of the groups.
In this work, we have checked the Euclidean norm, Mahalanobis, and Gustafson-Kessel.
The Gustafson-Kessel algorithm expanded the adaptive distance to detect different groupings with different geometrical forms. Each segment has its own distance given by The matrices become variables that are optimized within the functional so each group will have the distance that minimize its value. The only restriction imposed is that the determinant has to be positive, (| | = , > 0, ∀ ). Optimizing using the Lagrange multipliers method, we obtain that the distance matrices must fulfill this where is the fuzzy covariance matrix of each one of the segments.
The parameters of this algorithm are, in addition to the general parameters of segmentation, the volumes of the groups . If we do not have knowledge about this value, we set 1.
We have tested several measures to check which ones fits better to segment our dataset [27].

Results and Discussion
Once the previous methodology was defined and programmed, the Kaggle dataset with the pro-Isis Twitter registered users was firstly used. As criterion to choose which distance matrix and number of distances to be used, the configuration that allows a high membership degree for most of data has been established. For this dataset, the maximum value was obtained using the Gustafson-Kessel algorithm with two segments, as seen in Figure 6.
With this criterion, the division has been performed according to the variables mentioned (frequency, sentiment), eigenvector, and coherently with the rest of variables; it is possible to identify a more dangerous user group (red) and a less active group (blue).
However, one of the main advantages of this methodology is that we can identify users that have been identified within one group more than another, in spite of having a low membership degree. Figure 6 shows the membership degree to both groups, and the marked zone would be a user zone to be analyzed in detail. For example, with this same dataset, establishing a fuzzy membership threshold of 35-65%, we would obtain 2 doubtful users among 74 profiles. These 2 users are the focus of our work and should be studied individually and in time, to check whether they remain in the same place or have turned their behaviour to a more radical one, as seen in Figures 7 and 8.
To verify the consistency of our methodology, we performed a new experiment with an expanded dataset, added to the profiles identified as "fan boys". We have included other profiles considered of interest because of their connections with the users of the first set.
When downloading Twitter information, we have discarded the number of followers (as we have the information of users whom mention/retweet and are mentioned/retweeted) and number of publications (as we have the number of tweets published in the sampling).
After filtering users by those who generate content and those who receive it, we programmed a weighted graph to obtain centrality measures in our network.
The best segmentation value was obtained using Mahalanobis distance in two clusters, obtaining a FPC of 0.85 as we can see in Figure 9, which are represented the FPC obtained in base to the number of fuzzy clusters.
To clarify the results, in Figure 10 we are representing segmentation in base to the variables that we have used to calculate it, as frequency, eigenvector, and number of times that it has been mentioned. For this experiment, if we set a fuzzy membership threshold of 0.45, we find 123 users to be studied from the 3395 in total. As in the previous case, these users, represented as the blue box of Figure 11, are susceptible to be studied and monitored to trace their behavior along time.
On the other hand, after identifying the most active users, we checked how many of the users categorized as active users in the first dataset were categorized as active as well as in this second set (blue group). From 74 users of the first categorization, 59 were correctly grouped, and the fuzzy membership of 15 left users is represented in the boxplot of Figure 12. The distribution of the membership shows that more than 25% of these false positive and negative users had a very weak membership (between 0.50 and 0.6), which means that these users should be traced to check their behavior using our methodology.

Conclusions
The use of social networks as a manner to broadcast information has become a popular way to attract new followers to terrorism in general and Jihadism in particular. In this work, we have developed a methodology to identify potentially dangerous users that remain partially hidden, separated from those that are best known for being very active. In fact, the terrorist attack in Barcelona in 2017 was organized by terrorists whose profiles had not been classified as dangerous. Detecting and monitoring those new profiles can be crucial   to prevent or predict future terrorist actions and terrorist recruiting.
The presented methodology consists of defining a dataset of users plus several metrics to locate influential users. In addition, other metrics are obtained like frequency or the sentiment of their tweets. These metrics are used as vectors to perform segmentation of data. For this type of procedures, it is recommendable to use Soft Computing techniques that deal with the imprecision of the information. In the present problem we have used fuzzy clustering techniques to point out those users that were susceptible to be more active in the future and in consequence to be followed in time to check their behavior. Moreover, the analysis techniques proposed involve unsupervised algorithm, so they can be applied continuously, thereby this same methodology could be used to monitor users marked as fuzzy.
As for future works, we would like to expand this methodology other environments where user profiles could have similar patterns, like pedophilia or fake news. Fake news is known to have an important economical impact if they damage the image of a company and an important political impact if they can manipulate a significant number of voters.

Data Availability
The Kaggle dataset used to support the findings of this study have been deposited in the Kaggle repository under the name How ISIS Uses Twitter: https://www.kaggle.com/ fifthtribe/how-isis-uses-twitter.