1 Introduction

What is the best gateway between a source node and a target node, in a network? This is a core problem that appears under several guises, with numerous generalizations. Motivating applications include the following:

  1. 1.

    In a corporate social network, which are the key people that bring or hold different groups together? Or, if seeking to establish a cross-division project, who are the best people to lead such an effort?

  2. 2.

    In an immunization setting, given a set of nodes that are infected, and a set of nodes we want to defend, which are the best few ‘gateways’ we should immunize?

  3. 3.

    Similarly, in a network setting, which are the gateway nodes we should best defend against an attack, to maximize connectivity from source to target.

  4. 4.

    Protein pathways: given a protein interaction network, we know that a certain protein group corresponds to an earlier type of flu (e.g., normal flu), and another group corresponds to a new type of flu (e.g., swine flu), and we want to know which other proteins play a critical role in developing normal flu towards swine flu.

  5. 5.

    Given a graph of co-workers and their skills (keywords), whom should you contact to learn more about, say, Linux? You want someone reasonably close to you and fairly well-versed in Linux, but not your secretary or Linus Torvalds himself.

The problem has several, natural generalizations: (a) we may be interested in the top k best gateways (in case our first few choices are unavailable); (b) we may have more than one source nodes, and more than one target nodes, as in the immunization setting above; (c) we may have a bi-partite graph with relationships (edges) between different node types, as in the last example above. Our main contributions in this paper are:

  • A novel ‘gateway-ness’ score for a given source and target, that agrees with human intuition. Its generalization to the case where we have a group of nodes as the source and the target;

  • Two algorithms to find a set of nodes with the highest ‘gateway-ness’ score, which (1) are fast and scalable; and (2) lead to near-optimal results;

  • Extensive experimental results on real data sets, showing the effectiveness and efficiency of the proposed methods.

The rest of the paper is organized as follows: We give the problem definitions in Sect. 2; present ‘gateway-ness’ scores in Sect. 3; and deal with the computational issues in Sect. 4. We evaluate the proposed methods in Sect. 5. Finally, we review the related work in Sect. 6 and conclude in Sect. 7.

2 Problem definitions

Table 1 lists the main symbols we use throughout the paper. In this paper, we focus on directed weighted graphs. We represent the graph by its normalized adjacency matrix (A). Following standard notation, we use capital bold letters for matrices (e.g., A), lower-case bold letters for vectors (e.g., a), and calligraphic fonts for sets (e.g., \({{\mathcal{S}}}\)). We denote the transpose with a prime (i.e., A′ is the transpose of A). We use arrowed lower-case letters for paths on the graph (e.g., p), which are ordered sequences. We use parenthesized superscripts to represent source/target information for the corresponding variables. For example p (s,t) = {s = u 0u 1, ..., u l  = t} is a path from the source node s to the target node t. If the source/target information is clear from the context, we omit the superscript for brevity. A sink node i on the graph is a node without out-links (i.e., A(:,i) = 0). We use subscripts to denote the corresponding variable after setting the nodes indexed by the subscripts as sinks. For example, \({{\bf p}^{(s,t)}_{\mathcal{I}}}\) is the path from the source node s to the target node t, which does not go through any nodes indexed by the set \(\mathcal{I}\) (i.e., \({u_i\notin {\mathcal{I}},i=0,...,l}\)). With the above notations, our problems can be formally defined as follows:

Table 1 Symbols

Problem 1(Pair-Gateway)

  1. Given:

    a weighted directed graph A, a source node s, a target node t, and a budget (integer) k;

  2. Find:

    a set of at most k nodes which has the highest ‘gate-way-ness’ score wrt s and t.

Problem 2(Group-Gateway)

  1. Given:

    a weighted directed graph A, a group of source nodes \({{\mathcal{S}}, }\) a group of target nodes \({{\mathcal{T}},}\) and a budget (integer) k;

  2. Find:

    a set of at most k nodes which has the highest ‘gate-way-ness’ score wrt \({{\mathcal{S}}}\) and \({{\mathcal{T}}. }\)

In both Problem 1 (Pair-Gateway) and Problem 2 (Group-Gateway), there are two sub-problems: (1) how to define the ‘gateway-ness’ score of a given subset of nodes \(\mathcal{I}; \) (2) how to find the subset of nodes with the highest ‘gateway-ness’ score. In the next two sections, we present the solutions for each, respectively.

3 Proposed ‘Gateway-ness’ scores

In this section, we present our definitions for ‘Gateway-ness’. We first focus on the case of a single source s and a single target t (Pair-Gateway). We then generalize to the case where both the source and the target are a group of nodes (Group-Gateway)

3.1 Node ‘Gateway-ness’ score

Given a single source s and a single target t, we want to measure the ‘Gateway-ness’ score for a given set of nodes \({{\mathcal{I}}.}\) We first give the formal definitions in such a setting and then provide some intuitions for our definitions.

Formal definitions For a graph A, we can use random walk with restart to measure the proximity (i.e., relevance/closeness) from the source node s to the target node t, which is defined as follows: Consider a random particle that starts from node s. The particle iteratively transits to its neighbors with probability proportional to the corresponding edge weights. Also at each step, the particle returns to node s with some restart probability (1 − c). The proximity score from node s to node t is defined as the steady-state probability r(st) that the particle will be on node t (Tong et al. 2008). Intuitively, r(st) is the fraction of time that the particle starting from node s will spend on node t of the graph, after an infinite number of steps.

Intuitively, a set of nodes \({{\mathcal{I}}}\) are good gateways wrt s and t if they play an important role in the proximity measure from the source to the target. Therefore, our ‘Gateway-ness’ score can be defined as follows:

$$ \hbox{g}(s,t,{{\mathcal{I}}}) \triangleq \Updelta {\bf r}(s,t)\triangleq{\bf r}(s,t) - {\bf r}_{{\mathcal{I}}}(s,t) $$
(1)

where \({r_{\mathcal{I}}(s,t)}\) is the proximity score from source s to t after setting the subset of nodes indexed by \({{\mathcal{I}}}\) as sinks.

Intuitions Here, we provide some intuition of the ‘Gateway-ness’ score defined by (1), using the running example in Fig. 1.

Fig. 1
figure 1

Running example (best viewed in color). (Color figure online)

In Fig. 1, each solid arrowed line is a path from node 1 to node 20, which can be denoted by an ordered sequence. For example, the path marked by the red line can be denoted by p (1,20) = {1, 3, 4, 5, 12, 14, 20}. For each path p (s,t) = {s = u 0u 1, ..., u l  = t}, we can define its score by (2), where \(\prod\nolimits_{i=0}^l {\bf A}(u_{i-1},u_i)\) is the probability that the random particle will traverse this path, and (1 − c)c l penalizes the length of the path. For example, the red path (p (1,20) = {1, 3, 4, 5, 12, 14, 20}), has score (1 − c)c 6 A(3, 1)A(4, 3)A(5, 4)A(12, 5)A(14, 12)A(20, 14).

$$ \hbox{score}({\bf p}^{(s,t)}) \triangleq (1-c)c^l\prod_{i=0}^l {\bf A}(u_{i-1},u_i) $$
(2)

where A is the normalized adjacency matrix of the graph.

With the above definitions for the path score, we have the following lemma:

Lemma 1 Sum of Weighted Path Scores

Let P be the set of all the paths from the source node s to the target node t, and Q be the set of all the paths from the source node s to the target node t which go through at least one node indexed by the subset \({{\mathcal{I}}}\). Let r(st) be the proximity score defined by random walk with restart and \({g(s,t,{\mathcal{I}})}\) be the ‘Gateway-ness’ score defined by eq. (1). Then we have

$$ {\bf r}(s,t) = \sum_{{\bf p}^{(s,t)}\in {\bf P}} score({{\bf p}}^{(s,t)}); \quad g(s,t,{{\mathcal{I}}}) = \sum_{{\bf p}^{(s,t)}\in {\bf Q}} score({{\bf p}}^{(s,t)}) $$
(3)

Proof

Omitted for brevity. \(\square\)

By (3), the ‘Gateway-ness’ score for a given set of nodes \({{\mathcal{I}}}\) accounts for all the paths from the source node s to the target node t which pass through one or more nodes in \({{\mathcal{I}}}\). For example, given the source node 1 and the target node 20 in Fig. 3, the ‘Gateway-ness’ score for \({{\mathcal{I}}=\{2\}}\) is the sum of the scores of all the paths from node 1 to node 20 that go through node 2 (e.g., the green path, the yellow path, and so on).

3.2 Group ‘Gateway-ness’ score

Here we consider the case where the source and/or target consist of more than one nodes. Suppose we have a group of source nodes \({{\mathcal{S}}}\) and a group of target nodes \({{\mathcal{T}}. }\) Then, the ‘Gateway-ness’ score for a given set of nodes \(\mathcal{I}\) can be defined in a similar way:

$$ \hbox{g}({{\mathcal{S,T,I}}}) \triangleq \sum_{s\in {{\mathcal{S}}}, t\in {{\mathcal{T}}}}\Updelta {\bf r}(s,t)\triangleq\sum_{s\in {{\mathcal{S}}}, t\in {{\mathcal{T}}}}({\bf r}(s,t) - {\bf r}_{{\mathcal{I}}}(s,t)) $$
(4)

where \({r_{\mathcal{I}}(s,t)}\) is the proximity score from s to t by setting the subset of nodes indexed by \({{\mathcal{I}}}\) as sinks (i.e., delete all out-edges, by setting A(:,i) = 0 for all \({i\in{{\mathcal{I}}}}\)).

Intuitively, the score defined by (4) accounts for all the paths from the source group to the target groupFootnote 1 which go through at least one node in \({{\mathcal{I}}. }\) For example, given \({{\mathcal{S}}=\{1\}}\) and \({{\mathcal{T}}=\{19,20\}}\) in Fig. 1, the group ‘Gateway-ness’ score for \({{\mathcal{I}}=\{5,8\}}\) corresponds to all the paths from node 1 to 19 or 20 (e.g., red, yellow and green solid lines, purple and blue dashed lines and so on).

4 BASSET: proposed fast solutions

In this section, we address how to quickly find a subset of nodes of the highest ‘Gateway-ness’ score. We start by showing that the straight-forward methods (referred to as ‘Com-RWR’) are computationally intractable. Then, we present the proposed BASSET (BASSET-N for Pair-Gateway and BASSET-G for Group-Gateway). For each case, we first present the algorithm and then analyze its effectiveness as well as its computational complexity.

4.1 Computational challenges

Here, we present the computational challenges and the way we tackle them. For the sake of succinctness, we mainly focus on BASSET-N.

There are two main computational challenges in order to find a subset of nodes with the highest ‘Gateway-ness’ score. First of all, we need to compute the proximity from the source to the target on different graphs, each of which is a perturbed version of the original graph. This essentially means that we cannot directly apply some powerful pre-computational method to evaluate the proximity from the source to the target (after setting the subset of nodes indexed by \({{\mathcal{I}}}\) as sinks). Instead, we have to rely on on-line iterative methods, whose computational complexity is O(m). The challenges are compounded by the need to evaluate \({\hbox{g}(s,t,{\mathcal{I}})}\) (1) or \({\hbox{g}({\mathcal{S,T,I}})}\)(4) an exponential number of times \(({n \choose k })\). Putting these together, the straightforward way to find k nodes with the highest ‘Gateway-ness’ score is \(O({n \choose k }m). \) This is computationally intractable. Suppose on a graph with 1,000,000 nodes, we want to find the best k = 5 gateway nodes. If computing each proximity score takes 0.001 s, then 2.64 × 1017 years are needed to find the gateways. This is much longer than the age of the universe.Footnote 2

To tackle such challenges, we resort to two main ideas, which are summarized in Theorem 1. According to Theorem 1, in order to evaluate the ‘Gateway-ness’ score of a given set of nodes, we do not need to actually set these nodes as sinks and compute the proximity score on the new graph. Instead, we can compute it from the original graph. In this way, we can utilize methods based on pre-computation to accelerate the process. Furthermore, since \({\hbox{g}(s,t,{\mathcal{I}})}\) and \({\hbox{g}({\mathcal{S,T,I}})}\) are sub-modular wrt \({{\mathcal{I}}, }\) we can develop some greedy algorithm to avoid exponential enumeration, and still get some near-optimal solution. In Theorem 1, A is the normalized adjacency matrix of the graph. It is worth pointing out that the proposed methods (BASSET-N and BASSET-G) we will introduce are orthogonal to the specific way of normalization. For simplicity, we use column-normalization throughout this paper. Also, \({{\bf Q}({\mathcal{I}},{\mathcal{I}})}\) is a \({|{\mathcal{I}}|\times |{\mathcal{I}}|}\) matrix, containing the elements in the matrix Q which are at the rows/columns indexed by \({{\mathcal{I}}. }\) Similarly, \({{\bf Q}(t,{\mathcal{I}})}\) is a row vector with length \({|{\mathcal{I}}|, }\) containing the elements in the matrix Q which are at the t th row and the columns indexed by \({{\mathcal{I}}}\). \({{\bf Q}({\mathcal{I}},s)}\) is a column vector with length \({|{\mathcal{I}}|, }\) containing the elements in the matrix Q which are at the s th column and the rows indexed by \({{\mathcal{I}}.}\)

Theorem 1 Core Theorem

Let A be the normalized adjacency matrix of the graph, and Q = (1 − c)(I − c A)−1. For a given source s and target t, the ‘Gateway-ness’ score of a subset of nodes \({{\mathcal{I}}}\) defined in (1) satisfies the properties P1 and P2. For a given source group \({{\mathcal{S}}}\) and target group \({{\mathcal{T}}, }\) the ‘Gateway-ness’ score of a subset of nodes \({{\mathcal{I}}}\) defined in (4) satisfies the properties P3 and P4, where \({s\neq t, s,t\notin {\mathcal{I}}, {\mathcal{S}}\bigcap {\mathcal{T}}=\emptyset, {\mathcal{S}}\bigcap {\mathcal{I}}={\bf \emptyset}, }\) and \({{\mathcal{T}}\bigcap {\mathcal{I}}=\emptyset}\).

  • P1. \({g(s,t,{\mathcal{I}}) = {\bf Q}(t, {\mathcal{I}}){\bf Q}({\mathcal{I}},{\mathcal{I}})^{-1} {\bf Q}({\mathcal{I}},s);}\)

  • P2. \({g(s,t,{\mathcal{I}})}\) is sub-modular wrt the set \({{\mathcal{I}}.}\)

  • P3. \({g({\mathcal{S,T,I}}) = \sum\nolimits_{s\in {\mathcal{S}},t\in {\mathcal{T}}}{\bf Q}(t, {\mathcal{I}}){\bf Q}({\mathcal{I}},{\mathcal{I}})^{-1} {\bf Q}({\mathcal{I}},s); }\)

  • P4. \({g({\mathcal{S,T,I}})}\) is sub-modular wrt the set \({{\mathcal{I}}.}\)

Proof

See the "Appendix". \(\square\)

Intuition Here, we provide some intuition why \({\hbox{g}(s,t,{\mathcal{I}})}\) and \({\hbox{g}({\mathcal{S,T,I}})}\) are sub-modular. According to Lemma 1, for a given source s and a given target \({t, \hbox{g}(s,t,{\mathcal{I}}\cup {\mathcal{K}})-\hbox{g}(s,t,{\mathcal{I}})}\) accounts for the scores of all the paths from s to t, which go through some nodes in \({{\mathcal{K}}}\) but none of the nodes in \({{\mathcal{I}}. }\) Therefore, for a given set \({{\mathcal{K}}, }\) if we already have a bigger subset \({{\mathcal{J}}, }\) the additional benefit \({(\hbox{g}(s,t,{\mathcal{J}}\cup {\mathcal{K}})-\hbox{g}(s,t,{\mathcal{J}}))}\) will be relatively small, compared to the case where we have a smaller subset \({{\mathcal{I}} \, (\hbox{g}(s,t,{\mathcal{I}}\cup {\mathcal{K}})-\hbox{g}(s,t,{\mathcal{I}}))}\). For example, in Fig. 1, let s = 1, t = 20, and \({{\mathcal{I}}=\{5\}, {\mathcal{J}}=\{2,5\}. }\) Then, if we have a new subset \({{\mathcal{K}}=\{8\}, }\) the additional benefit for subset \({{\mathcal{I}}}\) accounts for all the paths from s = 1 to s = 20 which go through node 8, but not node 5 (e.g., the green path, etc). While the additional benefit for subset \({{\mathcal{J}}}\) is 0, since all the paths from s = 1 to t = 20 which go through node 8 must also go through some node in \({{\mathcal{J}}}\) (node 2).

4.2 BASSET-N for problem 1

4.2.1 BASSET-N algorithm

Our fast solution for Problem 1 is summarized in Algorithum 1. In Algorithum 1, after initialization (step 1), we first pick a node i 0 with the highest \(\frac{{\bf r}(s,i){\bf r}(i,t)}{{\bf r}(i,i)}\) (step 3). Then, in steps 4–14, we find the rest of the nodes in a greedy way. That is, in each outer loop, we try to find one more node while keeping the current \({{\mathcal{I}}}\) unchanged. According to P1 of theorem 1, v(i) computed in step 7 is the gateway score for the subset \({{\mathcal{J}}. }\) Footnote 3 If the current subset of nodes \({{\mathcal{I}}}\) can completely disconnect the source and the target (by setting them as sinks), we will stop the algorithm (step 12). Therefore, Algorithum 1 always returns no more than k nodes. It is worth pointing out that in Algorithum 1, all the proximity scores are computed from the original graph A. Therefore, we can utilize some powerful methods based on pre-computation to accelerate the whole process. To name a few, for a medium size graph A (e.g., a few thousands of nodes), we can pre-compute and store the matrix Q = (1 − c)(I − c A)−1; for large unipartite graphs and bipartite graphs, we can use the NB_LIN and BB_LIN algorithms, respectively (Tong et al. 2008).

Algorithm 1 BASSET-N

4.2.2 Analysis of BASSET-N

In this subsection, we analyze the effectiveness and the efficiency of Algorithum 1. First, the effectiveness of the proposed BASSET-N is guaranteed by the following lemma. According to Lemma 2, although BASSET-N is a greedy algorithm, the results it outputs are near-optimal.

Lemma 2 Effectiveness of BASSET-N

Let \({{\mathcal{I}}}\) be the subset of nodes selected by Algorithum 1 and \({|{\mathcal{I}}|=k_0. }\) Then, \({g(s,t,{\mathcal{I}})\ge (1-1/e) max_{|{\mathcal{J}}|=k_0} g(s,t,{\mathcal{J}}), }\) where \({g(s,t,{\mathcal{I}}), e}\) is the base of the natural logarithm, and \({g(s,t,{\mathcal{J}})}\) are defined by (1).

Proof

It is easy to verify that the node i 0 selected in step 10 of Algorithum 1 satisfies \({i_0 = \hbox{argmax}_{j\notin{\mathcal{I}},j\neq s,j\neq t}\hbox{g}(s,t,{\mathcal{I}}\bigcup j). }\) Also, we have g(st, ϕ) = 0, where ϕ is an empty set. On the other hand, according to Theorem 1, \({\hbox{g}(s,t,{\mathcal{I}})}\) is sub-modular wrt the subset \({{\mathcal{I}}. }\) Therefore (Nemhauser et al. 1978), we have \({\hbox{g}(s,t,{\mathcal{I}})\ge (1-1/e)\hbox{max}_{|{\mathcal{J}}|=k_0}\,\hbox{g}(s,t,{\mathcal{J}}),}\) which completes the proof. \(\square\)

Next, we analyze the efficiency of BASSET-N, which is given in Lemma 3Footnote 4. We can draw the following two conclusions, according to Lemma 3: (1) the proposed BASSET-N achieves a significant speedup over the straight-forward method (\(O(n\cdot k^4)\) vs. \(O({n \choose k }m)\)). For example, in the graph with 100 nodes and 1,000 edges, in order to find the gateway with k = 5 nodes, BASSET-N is more than 6 orders of magnitude faster, and the speedup quickly increases wrt the size of the graph; (2) the proposed BASSET-N is applicable to large graphs since it is linear wrt the number of the nodes.

Lemma 3 Efficiency of BASSET-N

The computational complexity of Algorithum 1 is upper bounded by \(O(n\cdot k^4).\)

Proof

The cost for steps 1–2 is constant. The cost for step 3 is O(n). At each inner loop (steps 6–7), the cost is O(nj 3 + nj 2). The cost for steps 9–13 is O(n). The outer loop has no more than k − 1 iterations. Putting these together, the computational cost for BASSET-N is:

$$ \begin{aligned} \hbox{Cost(BASSET-N)} &\le n + \sum_{j=1}^k (nj^3+nj^2+n) \\ & = n+nk+n\frac{k(k+1)(2k+1)}{6}+n\frac{k^2(k+1)^2}{4} = O(nk^4) \end{aligned} $$
(5)

which completes the proof. \(\square\)

4.3 BASSET-G for problem 2

4.3.1 BASSET-G algorithm

Our fast solution for Problem 2 is summarized in Algorithum 2. It works in a similar way as Algorithum 1: after initialization (step 1), we first pick a node i 0 with the highest \({\sum\nolimits_{s\in{\mathcal{S}},\,t\in{\mathcal{T}}}\frac{{\bf r}(s,i){\bf r}(i,t)}{{\bf r}(i,i)}}\) (step 3). Then, in steps 4–14, we find the rest of the nodes in a greedy way. That is, in each outer-loop, we try to find one more node while keeping the current \({{\mathcal{I}}}\) unchanged. If the current subset of the nodes \({{\mathcal{I}}}\) can completely disconnect the source group and the target group (by setting them as sinks), we will stop the algorithm (step 10). As in Algorithum 1, all the proximity scores are computed from the original graph A. Therefore, we can again utilize those powerful pre-computation based methods to accelerate the whole process.

Algorithm 2 BASSET-G

4.3.2 Analysis of BASSET-G

The effectiveness and efficiency of the proposed BASSET-G are given in Lemma 4 and Lemma 5, respectively. Similar as BASSET-N, the proposed BASSET-G is (1) near-optimal; and (2) fast and scalable for large graphs.

Lemma 4 Effectiveness of BASSET-G

Let \({{\mathcal{I}}}\) be the subset of nodes selected by Algorithum 2 and \({|{\mathcal{I}}|=k_0. }\) Then, \({g({\mathcal{S,T,I}})\ge (1-1/e)max_{|{\mathcal{J}}|=k_0}\,g({\mathcal{S,T,J}}), }\) where \({g({\mathcal{S,T,I}}), }\) and \({g({\mathcal{S,T,J}})}\) are defined by (4).

Proof

Similar as for Lemma 2. Omitted for brevity. \(\square\)

Lemma 5 Efficiency of BASSET-G

The computational complexity of Algorithum 2 is upper bounded by \({O(n\cdot(max(k,|{\mathcal{S}}|,|{\mathcal{T}}|))^4).}\)

Proof

Similar as for Lemma 3. Omitted for brevity. \(\square\)

5 Experimental evaluations

In this section we present experimental results. All the experiments are designed to answer the following questions:

  1. 1

    Effectiveness: how effective are the proposed ‘Gateway-ness’ scores in real graphs?

  2. 2

    Efficiency: how fast and scalable are the proposed BASSET-N and BASSET-G?

5.1 Experimental setup

Data sets We used six real data sets, which are summarized in Table 2.

Table 2 Summary of the data sets

The first data set (Karate) is an un-weighted unipartite graph, which describes friendship among the 34 members of a karate club at a US university (Zachary 1977). Each node is a member in the karate club and the existence of the edge indicates that the two corresponding members are friends. Overall, we have n = 34 nodes and m = 156 edges.

The second data set (PolBooks) is a co-purchasing book network.Footnote 5 Each node is a political book and there is an edge between two books if purchased by the same person. Overall, we have n = 105 nodes and m = 882 edges.

The third data set (MovieLens) is from GroupLens projectFootnote 6. It contains 100,000 rating information from 943 users on 1,682 movies. Each user has rated at least 20 movies from 1 (strongly unsatisfactory) to 5 (strongly satisfactory). We use this data set to construct a user-movie bipartite graph. We connect a user with a particular movie if s/he has given some positive ratings (4 or 5) to this movie. As the result, users who give only negative ratings and movies which receive only negative ratings are neglected. On the whole, there are 2,399 nodes (942 users, 1,447 movies) and 55,375 edges. The fourth data set (AC) and the fourth data set (AA) are both from DBLP.Footnote 7 The third data set (AC) is an un-weighted bipartite graph. We have two types of nodes: author and conference. The existence of the edge indicates that the corresponding author has published in the corresponding conference. Overall, we have 421,807 nodes and m = 2, 667, 199 edges.

The fifth data set (AA) is a co-authorship network, where each node is an author and the edge weight is the number of the co-authored papers between the two corresponding persons. Overall, we have n = 418,236 nodes and m = 2, 753, 798 edges.

The last data set (NetFlix) is from the Netflix prizeFootnote 8. Rows represent users and columns represent movies. If a user has given a particular movie positive ratings (4 or 5), we connect them with an edge. In total, we have 2,667,199 nodes (2,649,429 users and 17,770 movies), and 56,919,190 edges.

Parameter settings and machine configurations There is one parameter in BASSET-N and BASSET-G, the probability c for random walk with restart. We set c = 0.95, as suggested in (Tong et al. 2008). For the computational cost, we report the wall-clock time. All the experiments ran on the same machine with four 2.4GHz AMD CPUs and 48GB memory, running Linux (2.6 kernel). For each experiment, we run it 10 times and report the average.

5.2 Effectiveness

Here, we evaluate the effectiveness of the proposed ‘Gateway-ness’ scores. We first compare with several candidate methods in terms of separating the source from the target. And then, we present various case studies.

5.2.1 Quantitative comparisons

The basic idea of the proposed ‘Gateway-ness’ scores is to find a subset of nodes which collectively play an important role in measuring the proximity from the source node (or source group) to the target node (or target group). Here, we want to validate this basic assumption. We compare it with the following alternative choices: (a) selecting k nodes with the highest center-piece AND score (CePS-AND) (Tong and Faloutsos 2006); (b) selecting k nodes with the highest center-piece OR score (CePS-OR) (Tong and Faloutsos 2006); (c) randomly selecting k nodes (Rand); (d) randomly selecting k nodes from the neighboring nodes of the source node and the target node (Neighbor-Rand); (e) selecting k nodes with the highest \(\frac{{\bf r}(s,i){\bf r}(i,t)}{{\bf r}(i,i)}\)(Topk-Ind). We randomly select a source node s and a target node tFootnote 9 and then use the different methods to select a subset \({{\mathcal{I}}}\) with k nodes. Figure 2 presents the comparison results, where the x-axis is the number of nodes selected (k), and the y-axis is the normalized decay in terms of the proximity score from the source node s to the target node \({t\,(\frac{{\bf r}(s,t)-{\bf r}_{{\mathcal{I}}}(s,t)}{{\bf r}(s,t)})}\). The resulting curves are averaged over 1,000 randomly chosen source-target pairs. From Fig. 2, we can see that (1) the proposed BASSET-N performs best in terms of separating the source from the target; (2)Topk-Ind, where we simply select k nodes with highest \(\frac{{\bf r}(s,i){\bf r}(i,t)}{{\bf r}(i,i)}, \) does not perform as well as BASSET-N, where we want to find a subset of k nodes which collectively has the highest score \({ {\bf r}({\mathcal{I}},t)'{\bf r}({\mathcal{I}},{\mathcal{I}})^{-1} {\bf r}(s,{\mathcal{I}})'.}\)

Fig. 2
figure 2

Effectiveness comparison between BASSET-N and alternatives. Normalized decay of proximity versus k. Higher is better. The proposed BASSET-N (red star) is the best. (Color figure online)

5.2.2 Case studies

Next, we will show some case studies, to demonstrate the effectiveness of BASSET-N and BASSET-G.

Karate We start with Karate graph, which is widely used in social network analysis. In Fig. 3, there are two different communities in the graph (shaded). In each community, there are some ‘hub’ nodes (e.g., nodes 33 and 34 in the left community; and nodes 1 and 4 in the right community). The two communities are connected by some ‘bridging nodes’ (e.g., nodes 3, 10, 19, 20). Table 3 presents the resulting gateways of BASSET-N with the budget k = 5 for a few source-target pairs. The results are consistent with human intuition. The gateways either are the local center of the community that the source/target node belongs to, or are bridging nodes that connect the two communities when the source node and the target node belong to different communities. For example, if s = 1 and t = 33, the resulting nodes 3, 10, 11 are bridging nodes, while node 34 is the local center for the left community. Note that, we always return less than or equal to k = 5 nodes. For example, if s = 15 and t = 34, we only output one node (node 1) as the gateway. This is because all the paths from node 15 to node 34 must go through node 1.

Fig. 3
figure 3

Karate graph

Table 3 BASSET-N on Karate graph

PolBooks For this data set, the nodes are political books and the existence of the edge indicates the co-purchasing (by the same person) of the two books. Each book is annotated by one of the following three labels: ‘liberal’, ‘conservative’ and ‘neutral’. We pick a ‘liberal’ book (‘The Price of Loyalty’) as the source node, and a ‘conservative’ book (‘Losing Bin Laden’) as the target node. Then, we ran the proposed BASSET-N to find the gateway with 10 nodes. The result is presented in Table 4. The result is again consistent with human intuition, - the resulting gateway books are either popular books in one of the two communities (‘conservative’ vs. ‘liberal’) such as, ‘Bush country’ from ‘conservative’, ‘Back up suck up’ from ‘liberal’, etc; or those ‘neutral’ books which are likely to be purchased by readers from both communities (e.g., ‘Sleeping with the devil’, etc).

Table 4 BASSET-N on PolBooks graph

MovieLens For this data set, we pick up a user who likes comedy but not horror movies (i.e., she has given positive ratings to some comedy movies, but has given negative ratings to some horror movies, or has never rated any horror movies) as the source node. We also pick a user who likes horror but not comedy movies as the target node. Then, we ran BASSET-N to find the gateway with k = 10 nodes. Table 5 lists the resulting gateways for two such pairs of source-target. We can see that the resulting movies are either popular comedies (e.g., “Shakespeare in Love”) or horror/thrill movies (e.g., “Silence of the Lambs”). Interestingly, “Ghostbusters”, whose genre is both comedy and horror (therefore might be liked by both the source user and the target user), shows up in Table 5a.

Table 5 BASSET-N on MovieLens network
Table 6 BASSET-N on AC graph

AC This is a bipartite graph. Given a source conference/author and a target conference/author, we can run BASSET-N to find either the gateway conferences or the gateway authors (Table 6). Table 7 gives one such example when the source is ‘VLDB’ and the target is ‘NIPS’. Conceptually, we treat an n 1 × n 2 bipartite graph as a (n 1 + n 2) × (n 1 + n 2) unipartite graph, and we further restrict the search to the desired node type. Again, we can see that the results make sense. The resulting gateway authors are either productive in one of the two fields: databases vs. statistics, (e.g., Prof. Michael I. Jordan in statistics, Prof. Hector Garcia-Molina in databases, etc); or productive in data mining (e.g., Dr. Rakesh Agrawal, Prof. Jiawei-Han), which is an intersection field between statistics and databases. We have similar observations for the resulting gateway conferences. For example, ‘SIGMOD’ and ‘UAI’ are isomorphic (i.e., have very similar neighbor sets) to ‘VLDB’ and ‘NIPS’, respectively; and ‘KDD’ is one major conference in data mining, which is a highly plausible major connection from ‘VLDB’ (databases) to ‘NIPS’ (statistics / machine learning).

Table 7 BASSET-N on AC network

AA We use this data set to perform case studies for the proposed BASSET-G. We choose (1) a group of people from a certain field (e.g., ‘text’, ‘theory’, etc) as the source group \({{\mathcal{S}}; }\) and (2) another group of people in some other field (e.g., ‘databases’, ‘bioinfomatics’, etc) as the target group \({{\mathcal{T}}. }\) Then, we ran the proposed BASSET-N to find the gateway with k = 10 nodes. Table 8 lists some results. They are all consistent with human intuition, - the resulting authors are either productive authors in one of the two fields, or multi-disciplinary, who have close collaborations to both the source and the target groups of authors.

Table 8 BASSET-G on AA network

5.3 Efficiency

We will study the wall-clock running time of the proposed BASSET-N and BASSET-G here. Basically, we want to answer the following two questions:

  1. 1.

    (Speed) What is the speedup of the proposed BASSET-N and BASSET-G over the straightforward methods?

  2. 2.

    (Scalability) How do BASSET-N and BASSET-G scale with the size of the graph (n and m)?

First, we compare BASSET-N and BASSET-G with two straightforward methods: (1) ‘Com-RWR’, where we use combinatorial enumeration to find the gateway and, for each enumeration, we compute the proximity from the new graph; and (2) ‘Com-Eval’, where we use combinatorial enumeration to find the gateway, and for each enumeration, we compute the proximity from the original graph. Figure 4 shows the comparison on two real data sets. We can draw the following conclusions. (1) Straightforward methods (‘Com-RWR’ and ‘Com-Eval’) are computationally intractable even for a small graph. For example, on the Karate data set with only 34 nodes, it takes more than 20,560 seconds and 100,000 seconds to find the k = 10 gateway by ‘Com-Eval’ and by ‘Com-RWR’, respectively. (2) The speedup of the proposed BASSET-N and BASSET-G over both ‘Com-Eval’ and ‘Com-RWR’ is significant - in most cases, we achieve several (up to 6) orders of magnitude speedups. (3) The speedup of the proposed BASSET-N and BASSET-G over both ‘Com-RWR’ and ‘Com-Eval’ quickly increases wrt the size of the gateway k. Note that we stop running the program if it takes more than 100,000 seconds (i.e., longer than a day).

Fig. 4
figure 4

Comparison of speed. Wall-clock time versus k. Lower is better. Time is in logarithm scale. The proposed BASSET-N and BASSET-G (red star) are significantly faster. (Color figure online)

Next, we evaluate the scalability of the proposed BASSET-N and BASSET-G wrt the size of the graph, using the largest data set (NetFlix). From Fig. 5, we can make the following conclusions: (1) if we fix the number of nodes (n) in the graph, the wall-clock time of both BASSET-N and BASSET-G is almost constant wrt the number of edges (m); and (2) if we fix the number of edges (m) in the graph, the wall-clock time of both BASSET-N and BASSET-G is linear wrt the number of nodes (n). Therefore, they are suitable for large graphs.

Fig. 5
figure 5

Scalability of BASSET. Wall-clock time versus the size of the graph. Lower is better. \({|{\mathcal{S}}|=|{\mathcal{T}}|=5}\)

6 Related work

In this section, we review the related work, which can be categorized into four parts:

Betweenness centrality The proposed ‘Gateway-ness’ scores relate to measures of betweenness centrality, both those based on the shortest path (Freeman 1977), as well as those based on random walk (Newman 2005). When the gateway set size is k = 1, the proposed ‘Gateway-ness’ scores can be viewed as query-specific betweenness centrality measures. Moreover, in the proposed BASSET-N and BASSET-G, we aim to find a subset of nodes collectively, wherein traditional betweenness centrality, we usually calculate the score for each node independently (and then might pick k nodes with the highest individual scores).

Connection subgraphs In the proposed BASSET-N, the idea of finding a subset of nodes wrt the source/target is also related to the concept of connection subgraphs, such as (Faloutsos et al. 2004; Koren et al. 2006; Tong and Faloutsos 2006). However, in connection subgraphs, we aim to find a subset of nodes which have strong connections among themselves for the purpose of visualization. While in the proposed BASSET-N, we implicity encourage the resulting subset of nodes to be disconnected with each other so that they are able to collectively disconnect the target node from the source node to the largest extent (if we set them as sinks). It is interesting to notice that, if we want to find the gateway with k = 1 for BASSET-N, it can be viewed as a normalized directed version of CePS-AND score (Tong and Faloutsos 2006).Footnote 10 Moreover, We allow the more general case where the source/target is a group of nodes in the proposed BASSET-G; however in connection subgraphs, the source/target is always a single node.

Graph proximity The basic idea of the proposed BASSET-N and BASSET-G is to find a subset of nodes which will bring the largest decrease of the proximity score from the source node (or the source group) to the target node (or the target group). Graph proximity itself is an important building block in many graph mining settings. Representative work includes the BANKS system (Aditya et al. 2002), link prediction (Liben-Nowell and Kleinberg 2003), content-based image retrieval (He et al. 2004), cross-modal correlation discovery (Pan et al. 2004), pattern matching (Tong et al. 2007), ObjectRank (Balmin et al. 2004), RelationalRank (Geerts et al. 2004), etc.

Other related work in graph mining In recent years, graph mining is a very hot research topic. Representative work includes pattern and law mining (Albert et al. 1999; Broder et al. 2000), frequent substructure discovery (Jin et al. 2005; Xin et al. 2005), influence propagation (Kempe et al. 2003), fraud and anomaly detection (Neville et al. 2005; Noble and Cook 2003), recommendation (Agarwal and Merugu 2007; Cheng et al. 2007), community mining and graph partition (Backstrom et al. 2006; Chen et al. 2009, Gibson et al. 1998; Girvan and Newman ; Karypis and Kumar 1999; Qian et al. 2009), near-clique detection (Pei et al. 2005), etc.

7 Conclusion

In this paper, we study how to find good ‘gateway’ nodes in a graph, given one or more source and target nodes. Our main contributions are: (a) we formulate the problem precisely; (b) we develop BASSET-N and BASSET-G, two fast (up to 6 orders of magnitude of speedup) and scalable (linear wrt the number of the nodes in the graph) algorithms to solve it in a provably near-optimal fashion, using sub-modularity. We applied the proposed BASSET-N and BASSET-G on real data sets to validate the effectiveness and efficiency.