Abstract
Given a graph, how to find a small group of ‘gateways’, that is a small subset of nodes that are crucial in connecting the source to the target? For instance, given a social network, who is the best person to introduce you to, say, Chris Ferguson, the poker champion? Or, given a network of people and skills, who is the best person to help you learn about, say, wavelets? We formally formulate this problem in two scenarios: Pair-Gateway and Group-Gateway. For each scenario, we show that it is sub-modular and thus it can be solved near-optimally. We further give fast, scalable algorithms to find such gateways. Extensive experimental evaluations on real data sets demonstrate the effectiveness and efficiency of the proposed methods.
Similar content being viewed by others
1 Introduction
What is the best gateway between a source node and a target node, in a network? This is a core problem that appears under several guises, with numerous generalizations. Motivating applications include the following:
-
1.
In a corporate social network, which are the key people that bring or hold different groups together? Or, if seeking to establish a cross-division project, who are the best people to lead such an effort?
-
2.
In an immunization setting, given a set of nodes that are infected, and a set of nodes we want to defend, which are the best few ‘gateways’ we should immunize?
-
3.
Similarly, in a network setting, which are the gateway nodes we should best defend against an attack, to maximize connectivity from source to target.
-
4.
Protein pathways: given a protein interaction network, we know that a certain protein group corresponds to an earlier type of flu (e.g., normal flu), and another group corresponds to a new type of flu (e.g., swine flu), and we want to know which other proteins play a critical role in developing normal flu towards swine flu.
-
5.
Given a graph of co-workers and their skills (keywords), whom should you contact to learn more about, say, Linux? You want someone reasonably close to you and fairly well-versed in Linux, but not your secretary or Linus Torvalds himself.
The problem has several, natural generalizations: (a) we may be interested in the top k best gateways (in case our first few choices are unavailable); (b) we may have more than one source nodes, and more than one target nodes, as in the immunization setting above; (c) we may have a bi-partite graph with relationships (edges) between different node types, as in the last example above. Our main contributions in this paper are:
-
A novel ‘gateway-ness’ score for a given source and target, that agrees with human intuition. Its generalization to the case where we have a group of nodes as the source and the target;
-
Two algorithms to find a set of nodes with the highest ‘gateway-ness’ score, which (1) are fast and scalable; and (2) lead to near-optimal results;
-
Extensive experimental results on real data sets, showing the effectiveness and efficiency of the proposed methods.
The rest of the paper is organized as follows: We give the problem definitions in Sect. 2; present ‘gateway-ness’ scores in Sect. 3; and deal with the computational issues in Sect. 4. We evaluate the proposed methods in Sect. 5. Finally, we review the related work in Sect. 6 and conclude in Sect. 7.
2 Problem definitions
Table 1 lists the main symbols we use throughout the paper. In this paper, we focus on directed weighted graphs. We represent the graph by its normalized adjacency matrix (A). Following standard notation, we use capital bold letters for matrices (e.g., A), lower-case bold letters for vectors (e.g., a), and calligraphic fonts for sets (e.g., \({{\mathcal{S}}}\)). We denote the transpose with a prime (i.e., A′ is the transpose of A). We use arrowed lower-case letters for paths on the graph (e.g., p), which are ordered sequences. We use parenthesized superscripts to represent source/target information for the corresponding variables. For example p (s,t) = {s = u 0, u 1, ..., u l = t} is a path from the source node s to the target node t. If the source/target information is clear from the context, we omit the superscript for brevity. A sink node i on the graph is a node without out-links (i.e., A(:,i) = 0). We use subscripts to denote the corresponding variable after setting the nodes indexed by the subscripts as sinks. For example, \({{\bf p}^{(s,t)}_{\mathcal{I}}}\) is the path from the source node s to the target node t, which does not go through any nodes indexed by the set \(\mathcal{I}\) (i.e., \({u_i\notin {\mathcal{I}},i=0,...,l}\)). With the above notations, our problems can be formally defined as follows:
Problem 1(Pair-Gateway)
-
Given:
a weighted directed graph A, a source node s, a target node t, and a budget (integer) k;
-
Find:
a set of at most k nodes which has the highest ‘gate-way-ness’ score wrt s and t.
Problem 2(Group-Gateway)
-
Given:
a weighted directed graph A, a group of source nodes \({{\mathcal{S}}, }\) a group of target nodes \({{\mathcal{T}},}\) and a budget (integer) k;
-
Find:
a set of at most k nodes which has the highest ‘gate-way-ness’ score wrt \({{\mathcal{S}}}\) and \({{\mathcal{T}}. }\)
In both Problem 1 (Pair-Gateway) and Problem 2 (Group-Gateway), there are two sub-problems: (1) how to define the ‘gateway-ness’ score of a given subset of nodes \(\mathcal{I}; \) (2) how to find the subset of nodes with the highest ‘gateway-ness’ score. In the next two sections, we present the solutions for each, respectively.
3 Proposed ‘Gateway-ness’ scores
In this section, we present our definitions for ‘Gateway-ness’. We first focus on the case of a single source s and a single target t (Pair-Gateway). We then generalize to the case where both the source and the target are a group of nodes (Group-Gateway)
3.1 Node ‘Gateway-ness’ score
Given a single source s and a single target t, we want to measure the ‘Gateway-ness’ score for a given set of nodes \({{\mathcal{I}}.}\) We first give the formal definitions in such a setting and then provide some intuitions for our definitions.
Formal definitions For a graph A, we can use random walk with restart to measure the proximity (i.e., relevance/closeness) from the source node s to the target node t, which is defined as follows: Consider a random particle that starts from node s. The particle iteratively transits to its neighbors with probability proportional to the corresponding edge weights. Also at each step, the particle returns to node s with some restart probability (1 − c). The proximity score from node s to node t is defined as the steady-state probability r(s, t) that the particle will be on node t (Tong et al. 2008). Intuitively, r(s, t) is the fraction of time that the particle starting from node s will spend on node t of the graph, after an infinite number of steps.
Intuitively, a set of nodes \({{\mathcal{I}}}\) are good gateways wrt s and t if they play an important role in the proximity measure from the source to the target. Therefore, our ‘Gateway-ness’ score can be defined as follows:
where \({r_{\mathcal{I}}(s,t)}\) is the proximity score from source s to t after setting the subset of nodes indexed by \({{\mathcal{I}}}\) as sinks.
Intuitions Here, we provide some intuition of the ‘Gateway-ness’ score defined by (1), using the running example in Fig. 1.
In Fig. 1, each solid arrowed line is a path from node 1 to node 20, which can be denoted by an ordered sequence. For example, the path marked by the red line can be denoted by p (1,20) = {1, 3, 4, 5, 12, 14, 20}. For each path p (s,t) = {s = u 0, u 1, ..., u l = t}, we can define its score by (2), where \(\prod\nolimits_{i=0}^l {\bf A}(u_{i-1},u_i)\) is the probability that the random particle will traverse this path, and (1 − c)c l penalizes the length of the path. For example, the red path (p (1,20) = {1, 3, 4, 5, 12, 14, 20}), has score (1 − c)c 6 A(3, 1)A(4, 3)A(5, 4)A(12, 5)A(14, 12)A(20, 14).
where A is the normalized adjacency matrix of the graph.
With the above definitions for the path score, we have the following lemma:
Lemma 1 Sum of Weighted Path Scores
Let P be the set of all the paths from the source node s to the target node t, and Q be the set of all the paths from the source node s to the target node t which go through at least one node indexed by the subset \({{\mathcal{I}}}\). Let r(s, t) be the proximity score defined by random walk with restart and \({g(s,t,{\mathcal{I}})}\) be the ‘Gateway-ness’ score defined by eq. (1). Then we have
Proof
Omitted for brevity. \(\square\)
By (3), the ‘Gateway-ness’ score for a given set of nodes \({{\mathcal{I}}}\) accounts for all the paths from the source node s to the target node t which pass through one or more nodes in \({{\mathcal{I}}}\). For example, given the source node 1 and the target node 20 in Fig. 3, the ‘Gateway-ness’ score for \({{\mathcal{I}}=\{2\}}\) is the sum of the scores of all the paths from node 1 to node 20 that go through node 2 (e.g., the green path, the yellow path, and so on).
3.2 Group ‘Gateway-ness’ score
Here we consider the case where the source and/or target consist of more than one nodes. Suppose we have a group of source nodes \({{\mathcal{S}}}\) and a group of target nodes \({{\mathcal{T}}. }\) Then, the ‘Gateway-ness’ score for a given set of nodes \(\mathcal{I}\) can be defined in a similar way:
where \({r_{\mathcal{I}}(s,t)}\) is the proximity score from s to t by setting the subset of nodes indexed by \({{\mathcal{I}}}\) as sinks (i.e., delete all out-edges, by setting A(:,i) = 0 for all \({i\in{{\mathcal{I}}}}\)).
Intuitively, the score defined by (4) accounts for all the paths from the source group to the target groupFootnote 1 which go through at least one node in \({{\mathcal{I}}. }\) For example, given \({{\mathcal{S}}=\{1\}}\) and \({{\mathcal{T}}=\{19,20\}}\) in Fig. 1, the group ‘Gateway-ness’ score for \({{\mathcal{I}}=\{5,8\}}\) corresponds to all the paths from node 1 to 19 or 20 (e.g., red, yellow and green solid lines, purple and blue dashed lines and so on).
4 BASSET: proposed fast solutions
In this section, we address how to quickly find a subset of nodes of the highest ‘Gateway-ness’ score. We start by showing that the straight-forward methods (referred to as ‘Com-RWR’) are computationally intractable. Then, we present the proposed BASSET (BASSET-N for Pair-Gateway and BASSET-G for Group-Gateway). For each case, we first present the algorithm and then analyze its effectiveness as well as its computational complexity.
4.1 Computational challenges
Here, we present the computational challenges and the way we tackle them. For the sake of succinctness, we mainly focus on BASSET-N.
There are two main computational challenges in order to find a subset of nodes with the highest ‘Gateway-ness’ score. First of all, we need to compute the proximity from the source to the target on different graphs, each of which is a perturbed version of the original graph. This essentially means that we cannot directly apply some powerful pre-computational method to evaluate the proximity from the source to the target (after setting the subset of nodes indexed by \({{\mathcal{I}}}\) as sinks). Instead, we have to rely on on-line iterative methods, whose computational complexity is O(m). The challenges are compounded by the need to evaluate \({\hbox{g}(s,t,{\mathcal{I}})}\) (1) or \({\hbox{g}({\mathcal{S,T,I}})}\)(4) an exponential number of times \(({n \choose k })\). Putting these together, the straightforward way to find k nodes with the highest ‘Gateway-ness’ score is \(O({n \choose k }m). \) This is computationally intractable. Suppose on a graph with 1,000,000 nodes, we want to find the best k = 5 gateway nodes. If computing each proximity score takes 0.001 s, then 2.64 × 1017 years are needed to find the gateways. This is much longer than the age of the universe.Footnote 2
To tackle such challenges, we resort to two main ideas, which are summarized in Theorem 1. According to Theorem 1, in order to evaluate the ‘Gateway-ness’ score of a given set of nodes, we do not need to actually set these nodes as sinks and compute the proximity score on the new graph. Instead, we can compute it from the original graph. In this way, we can utilize methods based on pre-computation to accelerate the process. Furthermore, since \({\hbox{g}(s,t,{\mathcal{I}})}\) and \({\hbox{g}({\mathcal{S,T,I}})}\) are sub-modular wrt \({{\mathcal{I}}, }\) we can develop some greedy algorithm to avoid exponential enumeration, and still get some near-optimal solution. In Theorem 1, A is the normalized adjacency matrix of the graph. It is worth pointing out that the proposed methods (BASSET-N and BASSET-G) we will introduce are orthogonal to the specific way of normalization. For simplicity, we use column-normalization throughout this paper. Also, \({{\bf Q}({\mathcal{I}},{\mathcal{I}})}\) is a \({|{\mathcal{I}}|\times |{\mathcal{I}}|}\) matrix, containing the elements in the matrix Q which are at the rows/columns indexed by \({{\mathcal{I}}. }\) Similarly, \({{\bf Q}(t,{\mathcal{I}})}\) is a row vector with length \({|{\mathcal{I}}|, }\) containing the elements in the matrix Q which are at the t th row and the columns indexed by \({{\mathcal{I}}}\). \({{\bf Q}({\mathcal{I}},s)}\) is a column vector with length \({|{\mathcal{I}}|, }\) containing the elements in the matrix Q which are at the s th column and the rows indexed by \({{\mathcal{I}}.}\)
Theorem 1 Core Theorem
Let A be the normalized adjacency matrix of the graph, and Q = (1 − c)(I − c A)−1. For a given source s and target t, the ‘Gateway-ness’ score of a subset of nodes \({{\mathcal{I}}}\) defined in (1) satisfies the properties P1 and P2. For a given source group \({{\mathcal{S}}}\) and target group \({{\mathcal{T}}, }\) the ‘Gateway-ness’ score of a subset of nodes \({{\mathcal{I}}}\) defined in (4) satisfies the properties P3 and P4, where \({s\neq t, s,t\notin {\mathcal{I}}, {\mathcal{S}}\bigcap {\mathcal{T}}=\emptyset, {\mathcal{S}}\bigcap {\mathcal{I}}={\bf \emptyset}, }\) and \({{\mathcal{T}}\bigcap {\mathcal{I}}=\emptyset}\).
-
P1. \({g(s,t,{\mathcal{I}}) = {\bf Q}(t, {\mathcal{I}}){\bf Q}({\mathcal{I}},{\mathcal{I}})^{-1} {\bf Q}({\mathcal{I}},s);}\)
-
P2. \({g(s,t,{\mathcal{I}})}\) is sub-modular wrt the set \({{\mathcal{I}}.}\)
-
P3. \({g({\mathcal{S,T,I}}) = \sum\nolimits_{s\in {\mathcal{S}},t\in {\mathcal{T}}}{\bf Q}(t, {\mathcal{I}}){\bf Q}({\mathcal{I}},{\mathcal{I}})^{-1} {\bf Q}({\mathcal{I}},s); }\)
-
P4. \({g({\mathcal{S,T,I}})}\) is sub-modular wrt the set \({{\mathcal{I}}.}\)
Proof
See the "Appendix". \(\square\)
Intuition Here, we provide some intuition why \({\hbox{g}(s,t,{\mathcal{I}})}\) and \({\hbox{g}({\mathcal{S,T,I}})}\) are sub-modular. According to Lemma 1, for a given source s and a given target \({t, \hbox{g}(s,t,{\mathcal{I}}\cup {\mathcal{K}})-\hbox{g}(s,t,{\mathcal{I}})}\) accounts for the scores of all the paths from s to t, which go through some nodes in \({{\mathcal{K}}}\) but none of the nodes in \({{\mathcal{I}}. }\) Therefore, for a given set \({{\mathcal{K}}, }\) if we already have a bigger subset \({{\mathcal{J}}, }\) the additional benefit \({(\hbox{g}(s,t,{\mathcal{J}}\cup {\mathcal{K}})-\hbox{g}(s,t,{\mathcal{J}}))}\) will be relatively small, compared to the case where we have a smaller subset \({{\mathcal{I}} \, (\hbox{g}(s,t,{\mathcal{I}}\cup {\mathcal{K}})-\hbox{g}(s,t,{\mathcal{I}}))}\). For example, in Fig. 1, let s = 1, t = 20, and \({{\mathcal{I}}=\{5\}, {\mathcal{J}}=\{2,5\}. }\) Then, if we have a new subset \({{\mathcal{K}}=\{8\}, }\) the additional benefit for subset \({{\mathcal{I}}}\) accounts for all the paths from s = 1 to s = 20 which go through node 8, but not node 5 (e.g., the green path, etc). While the additional benefit for subset \({{\mathcal{J}}}\) is 0, since all the paths from s = 1 to t = 20 which go through node 8 must also go through some node in \({{\mathcal{J}}}\) (node 2).
4.2 BASSET-N for problem 1
4.2.1 BASSET-N algorithm
Our fast solution for Problem 1 is summarized in Algorithum 1. In Algorithum 1, after initialization (step 1), we first pick a node i 0 with the highest \(\frac{{\bf r}(s,i){\bf r}(i,t)}{{\bf r}(i,i)}\) (step 3). Then, in steps 4–14, we find the rest of the nodes in a greedy way. That is, in each outer loop, we try to find one more node while keeping the current \({{\mathcal{I}}}\) unchanged. According to P1 of theorem 1, v(i) computed in step 7 is the gateway score for the subset \({{\mathcal{J}}. }\) Footnote 3 If the current subset of nodes \({{\mathcal{I}}}\) can completely disconnect the source and the target (by setting them as sinks), we will stop the algorithm (step 12). Therefore, Algorithum 1 always returns no more than k nodes. It is worth pointing out that in Algorithum 1, all the proximity scores are computed from the original graph A. Therefore, we can utilize some powerful methods based on pre-computation to accelerate the whole process. To name a few, for a medium size graph A (e.g., a few thousands of nodes), we can pre-compute and store the matrix Q = (1 − c)(I − c A)−1; for large unipartite graphs and bipartite graphs, we can use the NB_LIN and BB_LIN algorithms, respectively (Tong et al. 2008).
4.2.2 Analysis of BASSET-N
In this subsection, we analyze the effectiveness and the efficiency of Algorithum 1. First, the effectiveness of the proposed BASSET-N is guaranteed by the following lemma. According to Lemma 2, although BASSET-N is a greedy algorithm, the results it outputs are near-optimal.
Lemma 2 Effectiveness of BASSET-N
Let \({{\mathcal{I}}}\) be the subset of nodes selected by Algorithum 1 and \({|{\mathcal{I}}|=k_0. }\) Then, \({g(s,t,{\mathcal{I}})\ge (1-1/e) max_{|{\mathcal{J}}|=k_0} g(s,t,{\mathcal{J}}), }\) where \({g(s,t,{\mathcal{I}}), e}\) is the base of the natural logarithm, and \({g(s,t,{\mathcal{J}})}\) are defined by (1).
Proof
It is easy to verify that the node i 0 selected in step 10 of Algorithum 1 satisfies \({i_0 = \hbox{argmax}_{j\notin{\mathcal{I}},j\neq s,j\neq t}\hbox{g}(s,t,{\mathcal{I}}\bigcup j). }\) Also, we have g(s, t, ϕ) = 0, where ϕ is an empty set. On the other hand, according to Theorem 1, \({\hbox{g}(s,t,{\mathcal{I}})}\) is sub-modular wrt the subset \({{\mathcal{I}}. }\) Therefore (Nemhauser et al. 1978), we have \({\hbox{g}(s,t,{\mathcal{I}})\ge (1-1/e)\hbox{max}_{|{\mathcal{J}}|=k_0}\,\hbox{g}(s,t,{\mathcal{J}}),}\) which completes the proof. \(\square\)
Next, we analyze the efficiency of BASSET-N, which is given in Lemma 3Footnote 4. We can draw the following two conclusions, according to Lemma 3: (1) the proposed BASSET-N achieves a significant speedup over the straight-forward method (\(O(n\cdot k^4)\) vs. \(O({n \choose k }m)\)). For example, in the graph with 100 nodes and 1,000 edges, in order to find the gateway with k = 5 nodes, BASSET-N is more than 6 orders of magnitude faster, and the speedup quickly increases wrt the size of the graph; (2) the proposed BASSET-N is applicable to large graphs since it is linear wrt the number of the nodes.
Lemma 3 Efficiency of BASSET-N
The computational complexity of Algorithum 1 is upper bounded by \(O(n\cdot k^4).\)
Proof
The cost for steps 1–2 is constant. The cost for step 3 is O(n). At each inner loop (steps 6–7), the cost is O(nj 3 + nj 2). The cost for steps 9–13 is O(n). The outer loop has no more than k − 1 iterations. Putting these together, the computational cost for BASSET-N is:
which completes the proof. \(\square\)
4.3 BASSET-G for problem 2
4.3.1 BASSET-G algorithm
Our fast solution for Problem 2 is summarized in Algorithum 2. It works in a similar way as Algorithum 1: after initialization (step 1), we first pick a node i 0 with the highest \({\sum\nolimits_{s\in{\mathcal{S}},\,t\in{\mathcal{T}}}\frac{{\bf r}(s,i){\bf r}(i,t)}{{\bf r}(i,i)}}\) (step 3). Then, in steps 4–14, we find the rest of the nodes in a greedy way. That is, in each outer-loop, we try to find one more node while keeping the current \({{\mathcal{I}}}\) unchanged. If the current subset of the nodes \({{\mathcal{I}}}\) can completely disconnect the source group and the target group (by setting them as sinks), we will stop the algorithm (step 10). As in Algorithum 1, all the proximity scores are computed from the original graph A. Therefore, we can again utilize those powerful pre-computation based methods to accelerate the whole process.
4.3.2 Analysis of BASSET-G
The effectiveness and efficiency of the proposed BASSET-G are given in Lemma 4 and Lemma 5, respectively. Similar as BASSET-N, the proposed BASSET-G is (1) near-optimal; and (2) fast and scalable for large graphs.
Lemma 4 Effectiveness of BASSET-G
Let \({{\mathcal{I}}}\) be the subset of nodes selected by Algorithum 2 and \({|{\mathcal{I}}|=k_0. }\) Then, \({g({\mathcal{S,T,I}})\ge (1-1/e)max_{|{\mathcal{J}}|=k_0}\,g({\mathcal{S,T,J}}), }\) where \({g({\mathcal{S,T,I}}), }\) and \({g({\mathcal{S,T,J}})}\) are defined by (4).
Proof
Similar as for Lemma 2. Omitted for brevity. \(\square\)
Lemma 5 Efficiency of BASSET-G
The computational complexity of Algorithum 2 is upper bounded by \({O(n\cdot(max(k,|{\mathcal{S}}|,|{\mathcal{T}}|))^4).}\)
Proof
Similar as for Lemma 3. Omitted for brevity. \(\square\)
5 Experimental evaluations
In this section we present experimental results. All the experiments are designed to answer the following questions:
-
1
Effectiveness: how effective are the proposed ‘Gateway-ness’ scores in real graphs?
-
2
Efficiency: how fast and scalable are the proposed BASSET-N and BASSET-G?
5.1 Experimental setup
Data sets We used six real data sets, which are summarized in Table 2.
The first data set (Karate) is an un-weighted unipartite graph, which describes friendship among the 34 members of a karate club at a US university (Zachary 1977). Each node is a member in the karate club and the existence of the edge indicates that the two corresponding members are friends. Overall, we have n = 34 nodes and m = 156 edges.
The second data set (PolBooks) is a co-purchasing book network.Footnote 5 Each node is a political book and there is an edge between two books if purchased by the same person. Overall, we have n = 105 nodes and m = 882 edges.
The third data set (MovieLens) is from GroupLens projectFootnote 6. It contains 100,000 rating information from 943 users on 1,682 movies. Each user has rated at least 20 movies from 1 (strongly unsatisfactory) to 5 (strongly satisfactory). We use this data set to construct a user-movie bipartite graph. We connect a user with a particular movie if s/he has given some positive ratings (4 or 5) to this movie. As the result, users who give only negative ratings and movies which receive only negative ratings are neglected. On the whole, there are 2,399 nodes (942 users, 1,447 movies) and 55,375 edges. The fourth data set (AC) and the fourth data set (AA) are both from DBLP.Footnote 7 The third data set (AC) is an un-weighted bipartite graph. We have two types of nodes: author and conference. The existence of the edge indicates that the corresponding author has published in the corresponding conference. Overall, we have 421,807 nodes and m = 2, 667, 199 edges.
The fifth data set (AA) is a co-authorship network, where each node is an author and the edge weight is the number of the co-authored papers between the two corresponding persons. Overall, we have n = 418,236 nodes and m = 2, 753, 798 edges.
The last data set (NetFlix) is from the Netflix prizeFootnote 8. Rows represent users and columns represent movies. If a user has given a particular movie positive ratings (4 or 5), we connect them with an edge. In total, we have 2,667,199 nodes (2,649,429 users and 17,770 movies), and 56,919,190 edges.
Parameter settings and machine configurations There is one parameter in BASSET-N and BASSET-G, the probability c for random walk with restart. We set c = 0.95, as suggested in (Tong et al. 2008). For the computational cost, we report the wall-clock time. All the experiments ran on the same machine with four 2.4GHz AMD CPUs and 48GB memory, running Linux (2.6 kernel). For each experiment, we run it 10 times and report the average.
5.2 Effectiveness
Here, we evaluate the effectiveness of the proposed ‘Gateway-ness’ scores. We first compare with several candidate methods in terms of separating the source from the target. And then, we present various case studies.
5.2.1 Quantitative comparisons
The basic idea of the proposed ‘Gateway-ness’ scores is to find a subset of nodes which collectively play an important role in measuring the proximity from the source node (or source group) to the target node (or target group). Here, we want to validate this basic assumption. We compare it with the following alternative choices: (a) selecting k nodes with the highest center-piece AND score (CePS-AND) (Tong and Faloutsos 2006); (b) selecting k nodes with the highest center-piece OR score (CePS-OR) (Tong and Faloutsos 2006); (c) randomly selecting k nodes (Rand); (d) randomly selecting k nodes from the neighboring nodes of the source node and the target node (Neighbor-Rand); (e) selecting k nodes with the highest \(\frac{{\bf r}(s,i){\bf r}(i,t)}{{\bf r}(i,i)}\)(Topk-Ind). We randomly select a source node s and a target node t, Footnote 9 and then use the different methods to select a subset \({{\mathcal{I}}}\) with k nodes. Figure 2 presents the comparison results, where the x-axis is the number of nodes selected (k), and the y-axis is the normalized decay in terms of the proximity score from the source node s to the target node \({t\,(\frac{{\bf r}(s,t)-{\bf r}_{{\mathcal{I}}}(s,t)}{{\bf r}(s,t)})}\). The resulting curves are averaged over 1,000 randomly chosen source-target pairs. From Fig. 2, we can see that (1) the proposed BASSET-N performs best in terms of separating the source from the target; (2)Topk-Ind, where we simply select k nodes with highest \(\frac{{\bf r}(s,i){\bf r}(i,t)}{{\bf r}(i,i)}, \) does not perform as well as BASSET-N, where we want to find a subset of k nodes which collectively has the highest score \({ {\bf r}({\mathcal{I}},t)'{\bf r}({\mathcal{I}},{\mathcal{I}})^{-1} {\bf r}(s,{\mathcal{I}})'.}\)
5.2.2 Case studies
Next, we will show some case studies, to demonstrate the effectiveness of BASSET-N and BASSET-G.
Karate We start with Karate graph, which is widely used in social network analysis. In Fig. 3, there are two different communities in the graph (shaded). In each community, there are some ‘hub’ nodes (e.g., nodes 33 and 34 in the left community; and nodes 1 and 4 in the right community). The two communities are connected by some ‘bridging nodes’ (e.g., nodes 3, 10, 19, 20). Table 3 presents the resulting gateways of BASSET-N with the budget k = 5 for a few source-target pairs. The results are consistent with human intuition. The gateways either are the local center of the community that the source/target node belongs to, or are bridging nodes that connect the two communities when the source node and the target node belong to different communities. For example, if s = 1 and t = 33, the resulting nodes 3, 10, 11 are bridging nodes, while node 34 is the local center for the left community. Note that, we always return less than or equal to k = 5 nodes. For example, if s = 15 and t = 34, we only output one node (node 1) as the gateway. This is because all the paths from node 15 to node 34 must go through node 1.
PolBooks For this data set, the nodes are political books and the existence of the edge indicates the co-purchasing (by the same person) of the two books. Each book is annotated by one of the following three labels: ‘liberal’, ‘conservative’ and ‘neutral’. We pick a ‘liberal’ book (‘The Price of Loyalty’) as the source node, and a ‘conservative’ book (‘Losing Bin Laden’) as the target node. Then, we ran the proposed BASSET-N to find the gateway with 10 nodes. The result is presented in Table 4. The result is again consistent with human intuition, - the resulting gateway books are either popular books in one of the two communities (‘conservative’ vs. ‘liberal’) such as, ‘Bush country’ from ‘conservative’, ‘Back up suck up’ from ‘liberal’, etc; or those ‘neutral’ books which are likely to be purchased by readers from both communities (e.g., ‘Sleeping with the devil’, etc).
MovieLens For this data set, we pick up a user who likes comedy but not horror movies (i.e., she has given positive ratings to some comedy movies, but has given negative ratings to some horror movies, or has never rated any horror movies) as the source node. We also pick a user who likes horror but not comedy movies as the target node. Then, we ran BASSET-N to find the gateway with k = 10 nodes. Table 5 lists the resulting gateways for two such pairs of source-target. We can see that the resulting movies are either popular comedies (e.g., “Shakespeare in Love”) or horror/thrill movies (e.g., “Silence of the Lambs”). Interestingly, “Ghostbusters”, whose genre is both comedy and horror (therefore might be liked by both the source user and the target user), shows up in Table 5a.
AC This is a bipartite graph. Given a source conference/author and a target conference/author, we can run BASSET-N to find either the gateway conferences or the gateway authors (Table 6). Table 7 gives one such example when the source is ‘VLDB’ and the target is ‘NIPS’. Conceptually, we treat an n 1 × n 2 bipartite graph as a (n 1 + n 2) × (n 1 + n 2) unipartite graph, and we further restrict the search to the desired node type. Again, we can see that the results make sense. The resulting gateway authors are either productive in one of the two fields: databases vs. statistics, (e.g., Prof. Michael I. Jordan in statistics, Prof. Hector Garcia-Molina in databases, etc); or productive in data mining (e.g., Dr. Rakesh Agrawal, Prof. Jiawei-Han), which is an intersection field between statistics and databases. We have similar observations for the resulting gateway conferences. For example, ‘SIGMOD’ and ‘UAI’ are isomorphic (i.e., have very similar neighbor sets) to ‘VLDB’ and ‘NIPS’, respectively; and ‘KDD’ is one major conference in data mining, which is a highly plausible major connection from ‘VLDB’ (databases) to ‘NIPS’ (statistics / machine learning).
AA We use this data set to perform case studies for the proposed BASSET-G. We choose (1) a group of people from a certain field (e.g., ‘text’, ‘theory’, etc) as the source group \({{\mathcal{S}}; }\) and (2) another group of people in some other field (e.g., ‘databases’, ‘bioinfomatics’, etc) as the target group \({{\mathcal{T}}. }\) Then, we ran the proposed BASSET-N to find the gateway with k = 10 nodes. Table 8 lists some results. They are all consistent with human intuition, - the resulting authors are either productive authors in one of the two fields, or multi-disciplinary, who have close collaborations to both the source and the target groups of authors.
5.3 Efficiency
We will study the wall-clock running time of the proposed BASSET-N and BASSET-G here. Basically, we want to answer the following two questions:
-
1.
(Speed) What is the speedup of the proposed BASSET-N and BASSET-G over the straightforward methods?
-
2.
(Scalability) How do BASSET-N and BASSET-G scale with the size of the graph (n and m)?
First, we compare BASSET-N and BASSET-G with two straightforward methods: (1) ‘Com-RWR’, where we use combinatorial enumeration to find the gateway and, for each enumeration, we compute the proximity from the new graph; and (2) ‘Com-Eval’, where we use combinatorial enumeration to find the gateway, and for each enumeration, we compute the proximity from the original graph. Figure 4 shows the comparison on two real data sets. We can draw the following conclusions. (1) Straightforward methods (‘Com-RWR’ and ‘Com-Eval’) are computationally intractable even for a small graph. For example, on the Karate data set with only 34 nodes, it takes more than 20,560 seconds and 100,000 seconds to find the k = 10 gateway by ‘Com-Eval’ and by ‘Com-RWR’, respectively. (2) The speedup of the proposed BASSET-N and BASSET-G over both ‘Com-Eval’ and ‘Com-RWR’ is significant - in most cases, we achieve several (up to 6) orders of magnitude speedups. (3) The speedup of the proposed BASSET-N and BASSET-G over both ‘Com-RWR’ and ‘Com-Eval’ quickly increases wrt the size of the gateway k. Note that we stop running the program if it takes more than 100,000 seconds (i.e., longer than a day).
Next, we evaluate the scalability of the proposed BASSET-N and BASSET-G wrt the size of the graph, using the largest data set (NetFlix). From Fig. 5, we can make the following conclusions: (1) if we fix the number of nodes (n) in the graph, the wall-clock time of both BASSET-N and BASSET-G is almost constant wrt the number of edges (m); and (2) if we fix the number of edges (m) in the graph, the wall-clock time of both BASSET-N and BASSET-G is linear wrt the number of nodes (n). Therefore, they are suitable for large graphs.
6 Related work
In this section, we review the related work, which can be categorized into four parts:
Betweenness centrality The proposed ‘Gateway-ness’ scores relate to measures of betweenness centrality, both those based on the shortest path (Freeman 1977), as well as those based on random walk (Newman 2005). When the gateway set size is k = 1, the proposed ‘Gateway-ness’ scores can be viewed as query-specific betweenness centrality measures. Moreover, in the proposed BASSET-N and BASSET-G, we aim to find a subset of nodes collectively, wherein traditional betweenness centrality, we usually calculate the score for each node independently (and then might pick k nodes with the highest individual scores).
Connection subgraphs In the proposed BASSET-N, the idea of finding a subset of nodes wrt the source/target is also related to the concept of connection subgraphs, such as (Faloutsos et al. 2004; Koren et al. 2006; Tong and Faloutsos 2006). However, in connection subgraphs, we aim to find a subset of nodes which have strong connections among themselves for the purpose of visualization. While in the proposed BASSET-N, we implicity encourage the resulting subset of nodes to be disconnected with each other so that they are able to collectively disconnect the target node from the source node to the largest extent (if we set them as sinks). It is interesting to notice that, if we want to find the gateway with k = 1 for BASSET-N, it can be viewed as a normalized directed version of CePS-AND score (Tong and Faloutsos 2006).Footnote 10 Moreover, We allow the more general case where the source/target is a group of nodes in the proposed BASSET-G; however in connection subgraphs, the source/target is always a single node.
Graph proximity The basic idea of the proposed BASSET-N and BASSET-G is to find a subset of nodes which will bring the largest decrease of the proximity score from the source node (or the source group) to the target node (or the target group). Graph proximity itself is an important building block in many graph mining settings. Representative work includes the BANKS system (Aditya et al. 2002), link prediction (Liben-Nowell and Kleinberg 2003), content-based image retrieval (He et al. 2004), cross-modal correlation discovery (Pan et al. 2004), pattern matching (Tong et al. 2007), ObjectRank (Balmin et al. 2004), RelationalRank (Geerts et al. 2004), etc.
Other related work in graph mining In recent years, graph mining is a very hot research topic. Representative work includes pattern and law mining (Albert et al. 1999; Broder et al. 2000), frequent substructure discovery (Jin et al. 2005; Xin et al. 2005), influence propagation (Kempe et al. 2003), fraud and anomaly detection (Neville et al. 2005; Noble and Cook 2003), recommendation (Agarwal and Merugu 2007; Cheng et al. 2007), community mining and graph partition (Backstrom et al. 2006; Chen et al. 2009, Gibson et al. 1998; Girvan and Newman ; Karypis and Kumar 1999; Qian et al. 2009), near-clique detection (Pei et al. 2005), etc.
7 Conclusion
In this paper, we study how to find good ‘gateway’ nodes in a graph, given one or more source and target nodes. Our main contributions are: (a) we formulate the problem precisely; (b) we develop BASSET-N and BASSET-G, two fast (up to 6 orders of magnitude of speedup) and scalable (linear wrt the number of the nodes in the graph) algorithms to solve it in a provably near-optimal fashion, using sub-modularity. We applied the proposed BASSET-N and BASSET-G on real data sets to validate the effectiveness and efficiency.
Notes
A path from the source group to the target group is a path which starts from a node of the source group and ends at a node of the target group.
According to Wikipedia, (http://en.wikipedia.org/wiki/Age_of_the_universe), the age of the universe is about 1.4 × 1010 years.
This is because in random walk with restart, we have r(i, j) = Q(j, i) for any i, j (Tong et al. 2008).
Here, we assume that the cost to get one proximity score is constant, which can be achieved with pre-computation methods (Tong et al. 2008).
The result when source and target are a group of nodes is similar, and omitted for brevity.
To see this, notice that in the case k = 1, in BASSET-N, we want to find the node with the highest \(\frac{{\bf r}(s,i){\bf r}(i,t)}{{\bf r}(i,i)}; \) while in CePS-AND (Tong and Faloutsos 2006), it picks the nodes with the highest r(s, i)r(t, i), where i = 1, ..., n and i ≠ s, i ≠ t.
References
Aditya, B., Bhalotia, G., Chakrabarti, S., Hulgeri, A., Nakhe, C., & Parag, S. S. (2002). Banks: Browsing and keyword searching in relational databases. In VLDB (pp. 1083–1086).
Agarwal, D., & Merugu, S. (2007). Predictive discrete latent factor models for large scale dyadic data. In KDD (pp. 26–35).
Albert, R., Jeong, H., & Barabasi, A. -L. (1999). Diameter of the world wide web. Nature, 401, 130–131.
Backstrom, L., Huttenlocher, D. P., Kleinberg, J. M., & Lan, X. (2006). Group formation in large social networks: membership, growth, and evolution. In KDD (pp. 44–54).
Balmin, A., Hristidis, V., & Papakonstantinou, Y. (2004). Objectrank: Authority-based keyword search in databases. In VLDB (pp. 564–575).
Broder, A., Kumar, R., Maghoul1, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., & Wiener, J. (2000). Graph structure in the web: experiments and models. In WWW conference.
Chen, J., Zaïane, O. R., & Goebel R. (2009). Detecting communities in social networks using max-min modularity. In SDM (pp. 978–989).
Cheng, H., Tan, P.-N., Sticklen, J., & Punch, W. F. (2007). Recommendation via query centered random walk on k-partite graph. In ICDM (pp. 457–462).
Faloutsos, C., McCurley, K. S., & Tomkins, A. (2004). Fast discovery of connection subgraphs. In KDD (pp. 118–127).
Freeman, L. C. (1977). A set of measures of centrality based on betweenness. Sociometry, 40, 35–41.
Geerts, F., Mannila, H., & Terzi, E. (2004). Relational link-based ranking. In VLDB (pp. 552–563).
Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Inferring web communities from link topology. In Ninth ACM conference on hypertext and hypermedia (pp. 225–234), New York.
Girvan, M., & Newman, M. E. J. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America, 99(12), 7821–7826.
He, J., Li, M., Zhang, H.-J., Tong, H., & Zhang, C. (2004). Manifold-ranking based image retrieval. In ACM Multimedia (pp. 9–16).
Jin, R., Wang, C., Polshakov, D., Parthasarathy, S., & Agrawal, G. (2005). Discovering frequent topological structures from graph datasets. In KDD (pp. 606–611).
Karypis, G., & Kumar, V. (1999). Multilevel -way hypergraph partitioning. In DAC (pp. 343–348).
Kempe, D., Kleinberg, J., & Tardos, E. (2003). Maximizing the spread of influence through a social network. KDD.
Koren, Y., North, S. C., & Volinsky, C. (2006). Measuring and extracting proximity in networks. In KDD (pp. 245–255).
Krause, A., & Guestrin, C. (2005). Near-optimal nonmyopic value of information in graphical models. In UAI (pp. 324–331).
Liben-Nowell, D., & Kleinberg, J. (2003). The link prediction problem for social networks. In Proceedings of CIKM.
Nemhauser, G. L., Wolsey, L. A., & Fisher, M. L. (1978). An analysis of approximations for maximizing submodular set functions-i. Mathematical Programming, 14, 265–294.
Neville, J., Simsek, Ö., Jensen, D., Komoroske, J., Palmer, K., & Goldberg, H. G. (2005). Using relational knowledge discovery to prevent securities fraud. In KDD (pp. 449–458).
Newman, M. (2005). A measure of betweenness centrality based on random walks. Social Networks, 27, 39–54.
Noble, C. C., & Cook, D. J. (2003). Graph-based anomaly detection. In KDD (pp. 631–636).
Pan, J. -Y., Yang, H.-J., Faloutsos, C., & Duygulu, P. (2004). Automatic multimedia cross-modal correlation discovery. In KDD (pp. 653–658).
Pei, J., Jiang, D., & Zhang, A. (2005). On mining cross-graph quasi-cliques. In KDD (pp. 228–238).
Piegorsch W., & Casella, G. E. (1990). Inverting a sum of matrices. In SIAM Review, vol. 32 (pp. 470–470).
Qian, T., Srivastava, J., Peng, Z., & Sheu, P. C.-Y. (2009). Simultaneously finding fundamental articles and new topics using a community tracking method. In PAKDD (pp. 796–803).
Tong, H., & Faloutsos, C. (2006). Center-piece subgraphs: problem definition and fast solutions. In KDD (pp 404–413).
Tong, H., Faloutsos, C., Gallagher, B., & Eliassi-Rad, T. (2007). Fast best-effort pattern matching in large attributed graphs. In KDD (pp. 737–746).
Tong, H., Faloutsos, C., & Pan, J.-Y. (2008). Random walk with restart: Fast solutions and applications. Knowledge and Information Systems: An International Journal (KAIS).
Xin, D., Han, J., Yan, X., & Cheng, H. (2005). Mining compressed frequent-pattern sets. In VLDB (pp. 709–720).
Zachary, W. W. (1977). An information flow model for conflict and fission in small groups. pp. 452–473.
Acknowledgments
This material is based upon work supported by the National Science Foundation under Grants No. IIS1017415, IIS 0905215, and DBI-0960443. Research was sponsored by the Defense Threat Reduction Agency under contract No. HDTRA1-10-1-0120 and by the Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053. It is continuing through participation in the Anomaly Detection at Multiple Scales (ADAMS) program sponsored by the U.S. Defense Advanced Research Projects Agency (DARPA) under Agreements No. W911NF-11-C-0200 and W911NF-11-C-0088. This work is also partially supported by an IBM Faculty Award and Google Mobile 2014 Program. The content of the information in this document does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 A proofs of the core theorem
Here, we provide the detailed proof of Theorem 1 for completeness.
Proof of P1
WLOG, we assume that \({{\mathcal{I}}=\{n-k+1,...n\}}\). Let A and \(\tilde{{\bf A}}\) be the normalized adjacency matrices of the graph before/after we set the subset of nodes in \({{\mathcal{I}}}\) as sinks. Write A and \(\tilde{{\bf A}}\) in block form:
where 0 is a matrix with all zero elements.
Let \(\tilde{{\bf Q}}=(1-c)({\bf I}-c\tilde{{\bf A}})^{-1}. \) We can also write \(\tilde{{\bf Q}}\) and Q in block form:
Applying the block matrix inverse lemma (Piegorsch 1990) to \(\tilde{{\bf Q}}\) and Q, we get the following equations:
Therefore, we have
On the other hand, based on the properties of random walk with restart (Tong et al. 2008), we have r(i, j) = Q(j, i), and \({{\bf r}_{{\mathcal{I}}}(i,j)=\tilde{{\bf Q}}(j,i), (i,j=1,...,n). }\) Together with (8), we have
which completes the proofs of P1. \(\square\)
Proof of P3
Since P1 holds, we have
which completes the proofs of P3. \(\square\)
Proof of P2
Let \({{\mathcal{I,J,K}}}\) be three subsets and \({{\mathcal{I}}\subseteq {\mathcal{J}}. }\) We will first prove by induction that, for any integer power j, the following inequality holds element-wise.
It is easy to verify the base case (i.e.,j = 1) for (11) holds. Next, assume that (11) holds for j = 1, ..., j 0, and we want to prove that it also holds for the case j = j 0 + 1:
In (12), the first inequality holds because of the induction assumption. The second inequality holds because \({{\bf A}_{{\mathcal{I}}}\ge {\bf A}_{{\mathcal{J}}}\ge 0}\) holds element-wise, and \({{\bf A}_{{\mathcal{I}}\bigcup {\mathcal{K}}}\ge {\bf A}_{{\mathcal{J}}\bigcup {\mathcal{K}}}\ge 0}\) holds element-wise.
Since \(\tilde{{\bf Q}}=(1-c)({\bf I}-c\tilde{{\bf A}})^{-1}=(1-c)\sum_{j=0}^\infty(c\tilde{{\bf A}})^j, \) we have
Therefore, \({\hbox{g}(s,t,{\mathcal{I}})}\) is sub-modular, which completes the proof of P2. \(\square\)
Proof of P4
Since \({\hbox{g}({\mathcal{S,T,I}})=\sum_{s\in {\mathcal{S}},t\in {\mathcal{T}}}\hbox{g}(s,t,{\mathcal{I}})}\) (In other words, \({\hbox{g}({\mathcal{S,T,I}})}\) is a non-negative linear combination of sub-modular functions) , according to the linearity of sub-modular functions (Krause and Guestrin 2005), we have that \({\hbox{g}({\mathcal{S,T,I}})}\) is also sub-modular, which completes the proof of P4. \(\square\)
Rights and permissions
About this article
Cite this article
Tong, H., Papadimitriou, S., Faloutsos, C. et al. Gateway finder in large graphs: problem definitions and fast solutions. Inf Retrieval 15, 391–411 (2012). https://doi.org/10.1007/s10791-012-9190-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10791-012-9190-3