Rumour Source Detection Using Game Theory

Social networks have become a critical part of our lives as they enable us to interact with a lot of people. These networks have become the main sources for creating, sharing and also extracting information regarding various subjects. But all this information may not be true and may contain a lot of unverified rumours that have the potential of spreading incorrect information to the masses, which may even lead to situations of widespread panic. Thus, it is of great importance to identify those nodes and edges that play a crucial role in a network in order to find the most influential sources of rumour spreading. Generally, the basic idea is to classify the nodes and edges in a network with the highest criticality. Most of the existing work regarding the same focuses on using simple centrality measures which focus on the individual contribution of a node in a network. Game-theoretic approaches such as Shapley Value (SV) algorithms suggest that individual marginal contribution should be measured for a given player as the weighted average marginal increase in the yield of any coalition that this player might join. For our experiment, we have played five SV-based games to find the top 10 most influential nodes on three network datasets (Enron, USAir97 and Les Misérables). We have compared our results to the ones obtained by using primitive centrality measures. Our results show that SV-based approach is better at understanding the marginal contribution, and therefore the actual influence, of each node to the entire network.


I. Introduction
R UMOUR Source Detection (RSD) aims to identify the most powerful nodes that are the primary sources of rumour propagation within a network. Social networking has become a modern tool for people to connect and spread the news with the development of science and technology. Diffusion of information in a social network can occur at lightning speeds and more often than not, this is considered a boon when it comes to relevant and correct information being spread. But at the same time, these networks can also be used to spread false or unverified information, either deliberately or by mistake. Therefore, rumours spread quickly and widely, and they have a great power of destruction. It is therefore of great theoretical and practical importance to decide whether there is an influential spreader and to recognize who is the influential spreader in the process for prevention and control of rumour propagation. This task is considered to be challenging due to the high speed of diffusion of information, and also because of the continuously evolving and dynamic nature of these social networks.
The most common approaches to finding the most influential node used in the past include single centrality [1] and group centrality [2] measures. The four major centrality measures are as follows. First, Degree Centrality (DC) refers to the number of associations that a node has with other nodes in a network. For an undirected graph, it is taken equal to the number of nodes to which a node is directly connected. For a directed graph, we need to compute the in-degree as well as the out-degree for each node. Second, Eigen-Vector Centrality (EVC) considers the relative power or significance of the nodes. Here, each node is assigned a value representing its relative significance considering the fact that nodes which are connected to high-power nodes have a stronger influence over the network in comparison to those which are connected to low-power nodes. Third, Betweenness Centrality (BC) measures how strongly two nodes are connected via a given node. It is estimated as the ratio of the aggregate of shortest distances between any two nodes in the network, on which the node lies, to the shortest path between the two nodes considered. Finally, Closeness Centrality (CC) measures how quickly rumour can be spread from one node to all the other nodes in a network. It is measured as the inverse of the total sum of all shortest path distances between a given node and all other nodes in a network. For more insight into centrality measures along with mathematical derivations, refer to [3].
But these measures have a lot of disadvantages as different measures are based on different concepts and emphasize upon different topological properties of the network. For instance, DC gives the same weight to all the neighbours of a node when computing its importance. It would be more intuitive to give higher weights to nodes that are themselves important. In EVC, most of the weights get concentrated in a relatively smaller subgraph and therefore, all nodes are not quantified as they should be [4]. The remaining measures do not tend to capture the flow of information in the graph. Moreover, single centrality measures suffer from an inevitable disadvantage due to the failure to recognize the effects when considered in groups on node functionality. Group centrality measures were created to overcome this barrier and place great focus on operating in groups of nodes and not on their individual functionalities. Nonetheless, group centrality also suffers from a drawback as it focuses on a-priori-determined node groups and contributes to confusion when prioritizing individual nodes within the network.
We aimed to work on game-theoretical algorithms to explore different strategies and metrics to assess the root cause of the rumour spread. Game-Theory is a significant paradigm that finds its applications in various fields. It is used in statistics and business analytics for prototyping the interactivity among participating agents [5]. Game-Theory has helped us to improve our presentiment, allowing for a logical analysis of various ideas which can be implemented in tandem with decision theory. Game-Theory has been widely used in the field of natural language processing. One of its most prominent applications is finding the most influential node within a network, which is relevant to our problem statement. We also do not face any of the above-mentioned disadvantages in this approach. Typical social network analysis cannot capture the dynamics of strategic interactions among the individuals in the network. Our proposed model is based on cooperative game-theory that solves this issue [6]. The elemental constituents of intricate interactivities in a network can be efficiently processed using a rich class of games, called influence games, as has been demonstrated in [7].
Shapley Value (SV) algorithm is a game-theoretic approach that has been explored in the past for finding the most influential nodes in a graphical network [8], [9], [10], but not for RSD problem specifically. The strategic issues in the Gale-Shapley model and its applications have been discussed in [11]. On the basis of the concept of marginal (or borderline) contribution, an important solution concept was proposed. Player i's SV, denoted by SV i (�), is equal to the weighted mean of i's borderline contributions to each coalition C, to which the player may belong. (1) In (1), the aggregate count of players is given by 'n' while π (I) gives the set of all permutations with 'n' players. This concept is based on cooperative game-theory -an aspect of game-theory which encourages players to form coalitions to maximize their yield in the game. Coalitions are gatherings of players that form the essential or fundamental elements of decision making. These are assumed to uphold cooperative conduct which makes it reasonable to view these games as a contest between alliances of participants and not between separate players. The core assumption here is that as the game proceeds, an eminent alliance or coalition comprising all participants will manifest eventually. The theory of cooperative games provides a high-level approach as it describes only the coalitions' structure, strategies and benefits. More insight into the SV algorithm and its derivation can be found in [8].
We have used SV-based centrality algorithm that is based on the key idea of a game-theoretic network which means defining a cooperative game across a network in which agents are nodes, coalitions are node groups, and coalition payoffs are defined to meet the requirements of a given application. The main contribution of our work is that we have explored the power of five different variations of the SV algorithm on various social networks that can be used for the purpose of spreading rumours.
We also used main centrality measures to identify the prominent top-k nodes to demonstrate a distinct and detailed contrast between our game-theoretical approach and the measures of prime centrality. Such a good analogy helped to portray the game-theoretical algorithm's aspects and accuracy vividly.
Section II gives a detailed study of various works done in the related field. Section III explains the datasets used and the algorithmic flow. Section IV describes the results obtained and the evaluations performed. Section V discusses the results and gives a theoretical explanation for the obtained results. Section VI concludes the research work with an insight into its future scope.

II. Related Work
One of the fundamental research discussions in the literature on network analysis is the topic of connectivity. The first to experiment to detect the primary top-k nodes were Domingos and Richardson [12]. They developed an algorithmic model to address this problem by modelling social media network as Markov random fields which mathematically characterized the probability of occurrence of an event.
Chen and Teng [1] explained that single node centrality measures are suitable for assessing individual influence in isolation while Shapley centrality assesses individuals' performance in group influence settings. Wei et. al. [2] explored the need to learn distributed vector representation for each vertex in a network. They laid emphasis on node classification and link prediction. An interesting approach to discover influential nodes in a network by formulating a target set selection problem has been discussed in [9]. Here, the problem comprises two main steps -the first step deals with finding a set of 'k' key nodes that would maximize the number of nodes being influenced in the network, while the second step is based on the λ-coverage problem.
We further investigated various kinds of centrality measures used for finding the most influential nodes in a network. DC, discussed by Gao et. al. [13], is used to efficiently measure the significance of nodes. However, it suffers from a severe disadvantage which is that it does not take into consideration the overall, detailed anatomy of the network. EVC, according to Stephenson and Zelen [14], overcomes the defects associated with DC. It takes into account the influence of neighbours of the node in consideration. BC, as explored by Freeman [15], learns topology-related data of networks in advance. Al-Garadi et. al. [16] describes how CC can be efficiently used to identify multiple influential spreaders. We also investigated the disadvantages associated with using centrality measures to find the most influential node in [1], [17], [18], which have been discussed in section I.
An attempt has been made to find the most influential node in a network using mapping entropy (ME) that reflects the correlation between a node and its neighbours [18]. We particularly inspected the application of ME using ENRON email dataset which is commonly used for the study of social networks [19]. ME recognizes the significance of a node in a complex network based on the knowledge of degree of the node and degrees of its neighbours. This technique for network attack helps to identify the node to attack, thereby saving valuable resources. However, the game-theoretic approach, that has been proposed, is able to capture and take into account the interactivity and dynamics of strategic interactions in a network, not only with immediate neighbours, but also with a larger subset of relationships in the graphs. Thus, we chose an SV-based algorithm to find the most influential nodes in a cooperative game.
Previous research by Tan et. al. [20] on spreading rumours focused primarily on communities' viral epidemics. The normal (and somewhat standard) model for viral epidemics is called the restored or SIR model that is susceptible-infected-recovered. There are three types of nodes in a typical rumour propagation model: i) vulnerable nodes capable of infection, ii) infected nodes capable of further spreading the virus, and iii) recovered nodes that are healed and no longer capable of infection. The most influential spreaders of rumour are identified. Various methods have been defined for the same including weighted k-core decomposition method [15] and rumour centrality with a mass centre technique [20]. An advanced form of this model, called the SEIR model, was also studied. Zhou et. al. [21] considered the graph topology and observed snapshots in a network to identify the single rumour source by formulating the nodes in a network into four possible states: susceptible (S), exposed (E), infected (I), and recovered (R).
We studied about Explosion-Trust (ET) Game Model by referring to [22]. It remarkably explains how a rumour spreading model can be constructed using game-theory by considering two very significant factors -rumour explosion degree and trust degree of the source node. In [23], a unique Belief-Propagation-based (BP) algorithm has been discussed that computes the joint likelihood function of the source location and the spreading time for the general continuous-time to detect the rumour source in a network. In [24], the concept of influence maximization has been explained from a game-theoretical perspective. A Coordination Game (CG) model, in which every individual node makes its decision based on the benefit of coordination with its network neighbours, has been proposed. SV or other game-theory solution theories can be applied to other network-related issues as well, for example, to the cost allocation problem in the electric market transmission system, and for each application, the mathematical aspects of the problem should then be addressed.
The original SV algorithms that have been implemented using Monte-Carlo simulations in the past are computationally expensive and may not arrive at an exact answer. Michalak et. al. [8] developed approximate analytical formulas for these simulations that run in polynomial time. They discuss five characteristic functions, each of which tries to convey a certain centrality concept. We have taken inspiration from their work and worked with five SV games that focus upon one characteristic function each. Furthermore, to show the comparisons of our work with existing literature, we have taken the works of Qiao et. al. [25], Hardin et. al. [26] and Munjal et. al. [27]. We found very few works that list out the top 10 most influential nodes on one of the datasets that we used in our study, with the help of primitive centrality measures. Hence, we have used these three works for our comparative study. Qiao et. al. [26] explored an entropy-based centrality measure along with the primitive centrality measures and tested it on the USAir97 dataset [28]. Hardin et. al. [26] studied the relationships in the Enron dataset [29], [30] using six centrality measures. Finally, Munjal et. al. [27] found the most influential nodes from the Les Misérables dataset [31]. We have performed our experiments on these three datasets and compared the top 10 most influential nodes obtained by using our five SV games, with the top nodes listed in these works. More details about the datasets used are given in section III.A.

III. Proposed Method For RSD
Section III.A explains the datasets used and their importance. Section III.B explains the algorithmic flow used in detail.

A. Datasets
This section gives an elaborate description of the datasets that have been used for our implementation. For our experiments, we required undirected, positive weighted-graphs that could be expressed as social networks, the top 10 most influential nodes of which were already known (so that we could compare our results with these already known influential nodes). We have used three major datasets which satisfy these criteria and they have been described below.

Unweighted Graph
An unweighted graph can be technically defined as a graph G(N, E) having 'n' nodes represented by set N and 'e' edges represented by set E consisting of unordered pairs, such that (n 1 , n 2 ) = (n 2 , n 1 ) and (n 1 , n 2 ) ∈ E and n 1 , n 2 ∈ N. Games 1 and 2 are played by creating an unweighted network from the datasets.

Weighted Graph
A weighted graph can be technically defined as a graph G(N, E) having 'n' nodes represented by set N and 'e' edges represented by set E consisting of ordered pairs, such that (n 1 , n 2 ) ≠ (n 2 , n 1 ) and (n 1 , n 2 ) ∈ E and n 1 , n 2 ∈ N. Games 3 -5 are played by creating a weighted network from the datasets.

Enron Dataset
The CALO Project (A Cognitive Assistant that Learns and Organizes) compiled and planned this dataset [29], [30]. It contains data from about 150 users, belonging to the Enron organization, grouped into files, mainly senior Enron executives. There are a total of about 0.5 M messages in the corpus. We used a subset of this dataset, containing 143 nodes (people from the Enron organization) and 1800 edges (an edge exists between two people if they have communicated with each other via email). Edges are weighted with the frequency of email exchanges between two users. This dataset can act as a social network which can be used to spread rumours within the members of the organization. Hence, we can identify the important nodes and assign them labels that symbolize their relative network value. This dataset has been commonly used for the study of social networks as well as for finding the most influential nodes [19], [26], [32], and so we have compared the results of our algorithm with other studies that used the same dataset [26].

Les Misérables
This is a co-occurrence graph for the characters that appear in the novel 'Les Misérables' by Victor Hugo [31]. This dataset consists of 77 nodes and 254 edges where a node represents a character and an edge between two nodes shows that these two characters appear in the same chapter of the book. The weight of each link indicates how often such a co-appearance occurs. This dataset too can act as a social network for the spread of a rumour. We have compared the results obtained by our SV-based approach with those obtained by various centrality measures used in [27].

USAir97
USAir97 dataset [28] has been transformed into an undirected network, created by 332 nodes, where one airport represents a node, and 2126 edges, with each edge reflecting a direct airline between two American airports if any. Here, weights represent the normalized distance between two airports. This dataset is not particularly useful for the purpose of rumour spreading but due to lack of supervised datasets with their most influential nodes known to us, we have included this dataset to test the results of our approach with the most influential nodes obtained by various centrality measures, as in [25].

B. Algorithm
Focusing on Game-Theory's Shapley algorithm, we referred to the algorithms described in Michalak and Szczepański's work [8]. In both weighted and unweighted networks, the exact analytical formulae for SV-based centrality were established. The SV-based centrality polynomial-time algorithms have been developed.

Creation of Weighted and Un-Weighted Network Graphs
Graphs were created by using the networkx library in Python for all three datasets. Games 1 and 2 require unweighted graphs whereas the remaining games require weighted graphs.

Coalition Games Based on Shapley Algorithm
SV is the average marginal cost contribution across all potential coalitions of the function value. The Shapley algorithm was applied carefully and it tries to find the top-k nodes that might be the most prominent nodes.
Specifically, we concentrated on five underlying network-defined coalition games that vary in degree and centrality of the network. Each game has a certain characteristic function v(C) which represents how prominent a particular node is to a given coalition C.
For more insight into the working of these games and their underlying mathematics, refer to [8]. The game descriptions are as follows: a) Game 1: In this game, we considered all the permutations of all the nodes that are immediately reachable, by one hop to the node ∈ N(G). Let each random permutation be denoted by Ρ , the neighbours of node , in the graph G(N, E) be denoted by n i .neighbours and the degree of node , be denoted by .degree. Algorithm 1 describes the procedure involved in SV calculation. A rumour source will, more often than not, affect farther nodes.
For the purpose of taking relationships with farther nodes into account, and generalising the game, we introduced a value, p, depicting the number of agents that the node is adjacent to in a coalition. In this game, a node is considered 'influenced' if at least p of its neighbours are influenced. We divided the analysis using this game into two parts, first, where the degree of the node is less than p and second where the degree is more than p.
Algorithm 2 describes the procedure involved in SV calculation. In this game, we introduced the concept of weighted graph networks. This game is an extension of game 1; it uses the Dijkstra Algorithm to compute the distance between 2 nodes. The cutoff value, d, is the maximum permissible distance of a node from any member in a given coalition.
The extended degree is defined as the size of the set of all nodes that are at most distance 'd' away from the node . Algorithm 3 describes the procedure involved in SV calculation. Here we worked with the assumption that a node closer to a coalition will have a greater effect on it than some other node farther away, even if both nodes satisfy the cut-off criteria as in game 3.
For this purpose, we introduced a positive-valued decreasing function f(x). f(d) refers to the function which has a directly proportional effect on SV of the coalition which is 'd' units away from a node.
The marginal contribution of each node through node ≠ j , for each coalition gives SV, as shown in Algorithm 4.

Estimating Centrality Measures
After working on the five coalition games, we introduced multiple centrality measures to determine the network's most powerful node with the highest scope or effect. To generate an elaborate comparison, various network centrality measures such as DC, EVC, BC, CC, have been used.

IV. Results
We experimented on three real-world network datasets -USAir97 dataset [28], Enron email dataset [29], [30] and Les Misérables dataset [31], and then compared the results of five coalition games defined previously, with the results obtained using the four aforementioned centrality measures. Qiao et. al. [25] has applied these centrality algorithms using the USAir97 network to assess the performance of network centrality model. Table I accurately shows for USAir97 dataset, the comparison between the top-k (k=10) nodes identified by our model for all the five coalition games and those identified by various centrality models employed in [25]. Also, Table II shows for Les Misérables dataset, the comparison between the top-k (k=10) nodes identified by our model for all the five coalition games, and those identified by various centrality models employed in [25].
We observed that the number of common items between the top-10 nodes found using coalitional game 1 and those found using DC, BC, and CC measures are nine, nine and four, respectively. The most significant observation is that the top-10 nodes are the same for both the coalitional game 1 and EVC measure. For Les Misérables dataset, we observed that node 11 was recognized as the most influential node in all five coalitional games and also using DC, BC, and CC measures. We noticed an overlap of six nodes in the observations of game 3, game 5 and CC measure. Similarly, we referred to the work of Hardin, Sarkis and Urc [26] to compare the efficiency of our model using Enron email dataset. Table  III shows the results obtained for the same.
We observed that Philip K. Allen, the Managing Director of Trading, appeared in the results of all the coalition games. Mike Grigbsy, VP of Trading, is also an important figure who is present in the results of three of the five games. Found in results of four coalition games, Barry Tycholiz is also the VP of Trading. Another person who can be identified as a prominent figure is Director for State Government, Jeff Dasovich. Game 5 recognizes Louise Kichen -the president of Enron -as one of the most significant nodes.
To get a better numerical understanding of our results, we used a comparison metric -The Jaccard Index, also known as the Union Intersection and the Jaccard Similarity Coefficient -which is used to calculate the similarity and diversity of sample sets.
The coefficient of Jaccard measures similarity between finite sample sets and is defined as the intersection size divided by the size of the union of sample sets which is shown in (2). (2) We compared the intersection similarity of the most significant nodes from each coalition game, with the results of the proposed model. Finally, for holistic comparison, we took the mean overall intersections, as shown in (3).
(3) I centrality depicts the mean of all intersections between sets over the five coalition games, where centrality denotes the centrality model used, and ij represents the intersection similarity between centrality measures with game j. The results are displayed in Table IV.

V. Discussion
We had observed many disadvantages in primitive centrality measures that had been used in the past for finding the most influential node, including putting too much focus on the individual node and not on the neighbours of the node. An elaborate description of these disadvantages is mentioned in section II. Game-theoretic approaches like the SV algorithm, take into consideration the marginal contribution of a node to every coalition that it is a part of. This approach has also not been specifically used in the past for RSD problem. For this reason, we aimed to explore the effectiveness of this approach for the purpose of RSD. Our results show a good similarity score (Jaccard Index) with the previous studies that used primitive centrality measures.
But as discussed, there were numerous disadvantages with these measures that our SV-based approach tried to overcome. Hence, we observe a slight difference between the most influential nodes found by our approach and those found by the earlier studies conducted on the same datasets.

VI. Conclusion
Sometimes the propagation of rumours on online social networks can lead to serious social problems. It is known to be of great value to accurately identify them from regular comments. Social media rumours have recently become a major concern, especially as people are aware of their ability to influence society. Rumours can not only cause social hysteria in all sorts of crises, but can also cause mass events that are unpredictable and threaten social stability.
We tried to introduce a game-theoretical algorithm in our research work in order to detect the origin of rumour in a complex network. The algorithm used is the algorithm of Shapley. We compared the performance of our game-theoretic approach with prime centrality measures. We also sought to locate prominent top nodes to catch and record multiple potential gossip sources, rather than concentrating discreetly on a single source. The most influential node identified is assumed to be the rumour source in the network.
To evaluate our algorithm on various real-world scenarios, we examined five different game situations, thereby taking into consideration various approaches to determine the most influential nodes in a given dataset. This helped us to gain a deeper and holistic understanding of the game-theoretical algorithm. The Jaccard Index has been used as a metric of comparison for our proposed method. The model has shown significant success as the most prominent nodes are successfully identified for both the datasets used.
We are currently working on expanding the theory of Shapley algorithm to consider each person's impact in a social network and thus determine the most serious cause of rumours. We plan to extend the idea of finding the most powerful node in social networks to numerous other similar applications for future work, such as the Internet, or urban networks, and involving a given node in disease dynamics. This will help us understand our algorithm's efficiency and accuracy in multiple applications in the real world.
Further, various optimisation techniques on the SV algorithm, for example, Fuzzy Logic will be implemented for mining much larger social networks and to improve accuracy and other relevant metrics of the project. Fuzzy-based implementation will solve various complexities and limitations that we are currently encountering.