Cascade source inference in networks: a Markov chain Monte Carlo approach

Cascades of information, ideas, rumors, and viruses spread through networks. Sometimes, it is desirable to find the source of a cascade given a snapshot of it. In this paper, source inference problem is tackled under Independent Cascade (IC) model. First, the #P-completeness of source inference problem is proven. Then, a Markov chain Monte Carlo algorithm is proposed to find a solution. It is worth noting that our algorithm is designed to handle large networks. In addition, the algorithm does not rely on prior knowledge of when the cascade started. Finally, experiments on real social network are conducted to evaluate the performance. Under all experimental settings, our algorithm identified the true source with high probability.


Introduction
Modern social and computer networks are common media for cascades of information, ideas, rumors, and viruses. It is often desirable to identify the source of a cascade from a snapshot of the cascade. For example, a good way to stop a rumor is to find out the person that has fabricated it. Similarly, identifying the first computer infected by a virus provides valuable information for catching the author. Therefore, given the network structure and an observed cascade snapshot consisting only the set of infected/active nodes, solving the source inference problem is very useful in many cases. Hereafter, we use infected/active and infect/activate interchangeably.
In the seminal works [1] and [2], source inference problem under susceptible-infected (SI) model is first studied, and a maximum likelihood estimator is proposed with theoretical performance bound when the network is a tree. Based on the same model, many works solve this problem with different extensions. With a priori knowledge of a candidate source set, reference [3] infers the source node using a maximum a posteriori estimator. Wang et al. [4] utilizes multiple independent epidemic observations to single out their common source. Karamchandani and Franceschetti [5] study the case where infected nodes reveal their infection with a probability. When multiple sources are involved, algorithms are proposed in [6] and [7] to find out all of them. The works mentioned above, except [7], are all based on tree networks, while some of them are applicable to general graphs by constructing breadth-first-search trees. More importantly, all of them use SI model, where an infected node will certainly infect a susceptible neighbor after a random period of time. Our work, however, is based on Independent Cascade (IC) model. In the IC model, an active node activates its successor with a certain probability determined by the edge weight.
Although SI model is popular in epidemiological researches because it catches the pattern of epidemics, the IC model is arguably more suitable to depict cascades in social networks, where relationship between peers plays a more important role than time of infection. As an example, suppose Alice bought a new hat, her classmates may or may not imitate the purchase depending on how they agree with her taste. Those who do not appreciate her taste are unlikely to change their minds even Alice wears her hat every day. These people are now immune from the influence of Alice's new hat, though they may still be persuaded by someone they appreciate more.
Although the IC model is popular in social network researches, finding source in the IC model is rarely studied. Using a model similar to the IC with identical edge weight, reference [8] studies the problem of inferring both links and sources given multiple observed cascades. Under the IC model, reference [9] solves the problem of finding sources that are expected to generate cascades most similar to the observation. Surprisingly, this problem is fundamentally different from source inference problem, which finds the source that most likely has started the observed cascade. For example, when a cascade that infects all nodes is observed in the simple linear network in Fig. 1, node c is the optimal result for the problem defined in [9] because it is expected to generate a cascade with least difference from the observed one. However, it is obvious that c cannot be responsible for a cascade that spreads through all three nodes.
In this paper, we work on the problem of detecting the source node that is responsible for a given cascade. We first formulate the source inference problem in the IC model and prove its #P-completeness. Then, a Markov chain Monte Carlo (MCMC) algorithm is proposed to solve the inference problem. It is worth noting that our algorithm scales with observed cascade size rather than network size, which is very important due to the huge size of social networks nowadays. Another advantage of our algorithm is that it is designed to deal with snapshot of cascades taken either before or after termination. More importantly, our algorithm does not require prior knowledge of the starting time of the cascade, which is usually unknown in practical scenarios. To evaluate the performance of our algorithm, experiments are done in a real network. Experimental results demonstrate the effectiveness of our algorithm.

Propagation model
In this work, we model a social network as a weighted directed graph G(V , E) with weights w i,j ∈ (0, 1] associated with each edge (i, j) ∈ E representing the probability Fig. 1 Example of a simple case of source inference problem: if all three nodes are found active, then node a must be the source of i successfully influencing j. The propagation procedure of a cascade in the network is depicted by the well-known IC model [10]. The cascade starts with all nodes inactive except a source node s, which we assume is activated at time τ 0 . At every time step τ > τ 0 , every node i that was activated at τ − 1 has a single chance to influence each of its inactive successors through the directed edge with success probability specified by the weight of the edge. If the influence is successful, then the successor is activated at time τ and will be able to influence its inactive successors at the next time step. The process terminates when no new node is activated.
An important fact about the IC model is that each active node has only one chance of influencing each of its neighbors. To put it another way, there is only one chance for each edge to participate in the propagation with success rate specified by the weight. Since edge weights are fixed and independent of the cascade, we can flip the biased coins even before the cascade starts to determine whether each edge will help the propagation. This gives an alternative process consisting of two steps that also simulates the IC model. First, a subgraph G of the original network G is taken by 1) keeping all vertices and 2) filtering edges according to their weights, i.e., Then, every node i reachable from source s in G is active, with its activation time set to τ + d G (s, i), where d G (s, i) is the distance, i.e., number of edges in the directed shortest path, from s to i in G .
It is easy to verify that the alternative process is equivalent to the previous one. Moreover, the alternative view builds the equivalence between sampling subgraphs of network and simulating cascades on it. Due to this convenience, we extensively use the alternative view in the following sections.

Source inference problem
Suppose in a given network G, an unnoticed cascade starts from an unknown source node s * at time τ 0 . Later at time τ 0 + τ , the cascade is discovered and the set of active nodes A τ is identified without knowing their corresponding activation time. Note that A τ can be viewed as a snapshot of the cascade at time τ . Now, we want to find the nodeŝ that most likely had started the cascade. Thus, where Pr (A τ |G, s, τ ) denotes the probability of a cascade on G starting from s having snapshot A τ at time τ . According to the alternative view of the IC model defined in the 'Propagation model' section and suppose G is sampled according to (1), we have The following theorem shows the intractability of source inference problem, i.e., solving (2) given G, τ , and A τ .

Theorem 1. Source inference problem is #P-complete.
This theorem is proven by constructing a polynomial-time Turing reduction from s-t connectedness problem [11] to source inference problem. Please refer to Appendix 1 for the detailed proof.

Basic algorithm
We use R(G , s, τ ) to denote the set of nodes in G reachable from s within distance τ , i.e., Then, the probability shown in (3) can also be written as where G represents the distribution of subgraphs of G defined by (1), Pr G (G ) denotes the probability mass function (PMF) of G in distribution G, i.e., and I is an indicator function defined as Because of the #P-completeness of source inference problem, calculating exact value of (4) is #P-hard. A trivial method to approximate the value is to estimate the expectation in (2) by randomly sampling graphs in G. But this method is still impractical. To show this, we define S = {G | G ⊆ G} as the set of all subgraphs of G, which is also the support of G. Then, a subset of S is defined as where s A τ ⊆ G denotes "every node in A τ is reachable from s in G ". Now, notice that A τ = R(G , s, τ ) =⇒ G ∈ S and that the ratio |S|/|S | can be exponential to |G|, which means almost all subgraphs of G will make the indicator function in equals 0. As an example, consider a linear graph To overcome this problem, we want to sample G from set S rather than S. On set S , we define a new sampling distribution, denoted as G , whose PMF is Notice that set S is independent of any candidate source node, so is the normalization factor Z. Therefore, with (7), we have Consequently, we can solve source inference problem (2) by solvinĝ Now the problem is how to sample from S with probability defined in (7). However, one can easily show that calculating factor Z is #P-hard, which makes calculating (7) impractical. Therefore, it is unlikely to be possible to directly sample from set S . Fortunately, the probability ratio between any two subgraphs is easy to compute; thus, we can use Metropolis algorithm to sample distribution G in a Markov chain Monte Carlo. Algorithm 1 describes a local move from a subgraph in S to another. Each local move will add/remove an edge to/from the previous subgraph G k . The new subgraph G k+1 is either accepted or rejected depending on the probability ratio Pr G (G k+1 )/Pr G (G k ) defined in G . Starting from any subgraph in S , running Algorithm 1 iteratively will produce a Markov chain whose states represent subgraphs in S and whose stationary distribution is exactly the same as (7).

Algorithm 1: Local move
With the help of local move in Algorithm 1, Algorithm 2 infers the most likely source node responsible for the cascade snapshot A τ taken at time τ . Input parameter K is used to indicate the number of samples to take by this algorithm. With line 3, the algorithm starts with whole graph G as the initial sample, which is obviously in S . During every iteration of the while-loop, a subgraph in S is sampled, and all possible source vertices are found and recorded. After the while-loop ends, Hence, the returned value of Algorithm 2 is an approximate solution of (8).

A more practical approach
Algorithm 2 has some drawbacks in practical scenarios. First, the whole network may be orders of magnitude larger than the cascade snapshot in question. However, Algorithm 2 Algorithm 2: Basic source inference algorithm Input: instance: G, w, A τ , τ ; parameter: K Output: s 1 create new array count with size |V | and default value 0; scales with the size of full network rather than the snapshot, which is unfavorable here. Second, when the source node of a cascade is unknown, the starting time of the cascade is usually also absent. In these cases, inferring source node without knowing τ is desired. In this section, we will handle these two problems.
Based on the cascade snapshot A τ , we can classify edges in E into three disjoint subsets And E 2 can be further split into subsets according to the source node of edges: Then we define three subgraphs of G(V , E) accordingly: E 3 ). Note that G 1 only contains nodes in A τ because edges in G 1 are all between nodes in A τ . Furthermore, we partition each sampled subgraph G into G 1 , G 2 and G 3 , where G k = G ∩ G k . With these definitions, we have the following lemma. Lemma 1. If we define subgraph G 1 (A τ , E 1 ) consisting of only edges between nodes in A τ , the condition is equivalent to the combination of and where Proof. Eq. 10 can be split to 1) any node in A τ must be within distance τ from s, i.e., and 2) any node outside A τ must have distance from s larger than τ , i.e., Hence, the shortest path from s to any node i ∈ A τ is within G 1 , which implies ∀i ∈ A τ , d G (s, i) = d G 1 (s, i) and thus (11). Further, (12) means any node i with d G (s, i) < τ must not be able to activate its neighbors outside A τ , which is necessary to ensure (14).
From Lemma 1, it is straightforward to get the following corollaries. (4) is equivalent to

Corollary 1. The indicator function in
In addition, because G = G 1 ∪ G 2 ∪ G 3 and edge sets in G k are disjoint, (6) can be rewritten as the product of three terms where Now we have Theorem 2 that speedup the algorithm. where The proof of Theorem 2 is shown in Appendix 2. Theorem 2 shows that sampling subgraphs of G 1 , rather than the whole network G, is sufficient to infer the cascade source, which greatly accelerates the algorithm when the whole network is much larger than the cascade snapshot A τ .
Next, we deal with unknown cascade starting time, i.e., unknown τ . First, due to the fact that node set in G 1 is A τ , where G 1 (s) is the eccentricity of node s in G 1 , defined as As a result, for any given G 1 and s such that s A τ ⊆ G 1 , there are three possible values for function f (G , s, τ ) in (18): Here, the values for all three cases are independent of τ . Then, we have Theorem 3 that deals with unknown cascade starting time.

1.
Proof. Because samples G 1,k are taken with distribution G 1 , we have Substituting (20) into the summation of (22) proves the theorem.
With both Theorems 2 and 3, Algorithm 2 can be improved to Algorithm 3 which overcomes problems of large network and unknown τ .
It should be noted that for any sample G 1 , line 9 in Algorithm 3 can be done in O(|G 1 |) time. First, condensation C(G 1 ) is calculated, which needs linear time. Then, since C(G 1 ) is a directed acyclic graph, there is at least one strong component in C(G 1 ) that has no predecessor. If there is exactly one such component, it is the set C; if there is more than one, C = ∅. This method also applies to line 8 in Algorithm 1 and line 5 in Algorithm 2.

Experimental results
In this section, we conduct experiments of our cascade source inference algorithm (Algorithm 3, with K = 10 6 ) on real network dataset. The network used is from WikiVote dataset ( [12,13]), which consists of all Wikipedia voting data from the inception of Wikipedia till January 2008. The dataset has 7115 nodes and 103,689 directed unweighted edges. Each node represents a Wikipedia user participating in elections, while each directed node (i, j) means user i voted for user j. We use this unweighted dataset because we cannot find a social network dataset with influence probability available despite our best effort. Since the dataset is unweighted, we use reciprocal of in-degree of the destination node as the weight of an edge. With uniformly randomly chosen source nodes, cascades are then generated on the network according to the IC model. To make the experiment challenging, we discard cascades with less than 20 candidate sources. Here, candidate source set is not active nodes set A τ , but set of nodes from which all active nodes are reachable in G 1 , i.e., {i | i A τ ⊆ G 1 }. We use 200 cascades in our experiments. Figure 2a, b shows histograms of the number of active nodes and candidate sources among these cascades.
To compare our proposed algorithm with existing algorithm, we also implement the algorithm proposed by [9]. In that paper, they proposed three algorithms ("DP", "Sort", and "OutDegree") to find a set of k sources. In our case where single source generates the cascade, their DP algorithm and Sort' algorithm are equivalent. In the experiment below, we use this algorithm and call it "Effector" algorithm.
First, we take snapshot at τ = |A τ |, i.e., after cascades terminate and do the experiment with exact knowledge of τ . Figure 3a shows the distribution of error distances, which is defined as the distance between inferred source node and true source node assuming edges are undirected. To compare with, the error distance of random guess among A τ is also shown in Fig. 1c. It is clear that all source nodes inferred by our algorithm are within two hops around the source node, and 24 % of the inferred nodes are true sources. In comparison, the Effector algorithm has fewer results with 0 or 1 error distance. To further evaluate the algorithm, we make the algorithms output a list of candidate source nodes sorted in descending order of likelihood, rather than merely the most likely source node. This output is sometimes more useful because it answers queries like "what's the 5 most   Figure 3b shows the distribution of rank of the true source node in the ordered list. In more than half of total experiments, the true source is among top 4 candidates output by our algorithm. The Effector algorithm, however, has a much heavier tail with far less results with lower ranks. In fact, there are 15 % of the results with a rank higher than 60 which is not shown in the figure. Figure 3c shows distribution of relative ranks, i.e., rank divided by candidate set size. Only our algorithm is shown in this figure because the Effector algorithm does not calculate candidate set and their output list include many nodes not in the candidate set due to the reason explained by Fig. 1 in the 'Introduction' section. In more than 50 % of the experiments, our output that has relative rank of the true source is less than or equal to 0.1.
Then, we do experiments with snapshots taken at τ = 8, when most of the cascades are yet to terminate. The results are shown in Fig. 4. Similarly, our proposed algorithm performs better than the Effector algorithm. In 55 % of the experiments, our algorithm has true source node among top 4 candidates, and in half of experiments, we have true source node with relative rank no larger than 0.1.
To evaluate the performance of our source inference algorithm when exact cascade starting time is absent, we conduct another experiment on the snapshot taken at τ = 8 with input time range [ 0, 16]. As shown in Fig. 5, our algorithm effectively infers the source nodes even without exact knowledge of cascade starting time. In the experiment, 57 % of the true source nodes are among top 4 candidates, and in half of the cases, the true source ranked top 10 % in the output list.

Conclusion
We considered cascade source inference problem in the IC model. First the #Pcompleteness of this problem was proven. Then, a Markov chain Monte Carlo algorithm was proposed to approximate the solution. Our algorithm was designed with two major c Histogram of relative rank of true source advantages: 1) it scales with the observed cascade snapshot rather than the whole network and thus is applicable to enormous modern social networks and 2) it does not require any knowledge about the starting time of the cascade, which is a common and practical scenario in cascade source inference problem. To demonstrate the performance of our algorithm, experiments on real social network were conducted. As shown above, our algorithm performs well no matter when the cascade snapshot is taken or whether the cascade starting time is known. In all these experiments, around 25 % of the true sources are correctly identified, about half of them are among the top 4 or top 10 % of the candidates.
Proof. In this proof, we use i j ⊆ G to denote the existence of a path from i to j in graph G. In addition, i V ⊆ G means ∀j ∈ V , j = i, i j ⊆ G. According to the algorithm, output snapshot A τ contains all vertices, and τ = |V | guarantees that d G (i, j) < τ if i j ⊆ G . Therefore, due to (3), the output instance has which means considering reachability rather than distance is sufficient in the remaining part of the proof. Now, due to line 4 in Algorithm 4, every node inV is reachable from v in every subgraph G sampled via (1). And because w t,v = 1 (by line 3), for any subgraph G , Thus property 1 is straightforward: On the other hand, since the new node u has only one incoming edge (v, u), we have ∀i ∈V , i = t, i u ⊆ G implies i t ⊆ G . Therefore, we have the proof for property 2: for any i ∈V , i = t, where the last inequality is because every incoming edge of t has weight 0.5 according to line 5 in Algorithm 4. To prove property 3, we first note that s is the only successor of u and w u,s = 1, with (23), we have And therefore, BecauseĜ ⊂ G, sampling subgraphs G of G can be viewed as sampling subsets ofÊ followed by sampling subsets of E \Ê. Since any path from s to t consists only edges inÊ, Pr (s t ⊆ G ) is fully determined by samplingÊ, or equivalently, sampling subgraphs of G. As a result, because every subset ofÊ has probability 0.5 |Ê| to be selected via (1) according to line 5 in Algorithm 4. Now property 3 follows from (24) and (25).
Proof. First, to show source inference problem is in #P, we note that calculating Pr(A t |G, i, τ ) is in #P since it is the sum of probabilities of all subgraphs of G with i V ⊆ G. So source inference problem, i.e., finding node i that maximize Pr(A t |G, i, τ ), is also in #P.
Since graphĜ has 2 |Ê| subgraphs, ConnectednessĜ, s, t must be an integer in range [ 0, 2 |Ê| ]. Therefore, Pr (A t |G, u, τ ) of the output instance of Algorithm 4 must be in set {k · 0.5 |Ê| | k ∈ N, k ≤ 2 |Ê| }. A binary search algorithm, i.e., Algorithm 5, can solve s-t connectedness problem by solving source inference problem. In Algorithm 5, there will be |Ê| iterations of while-loop. Hence, only polynomial number of queries to the oracle will be made. All other operations can be done in polynomial time. Therefore, this algorithm shows a polynomial-time Turing reduction from s-t connectedness problem to source inference problem. Since s-t connectedness problem is #P-complete and source inference problem is in #P, Theorem 1 is proven.