A linear delay algorithm for enumerating all connected induced subgraphs

Background Real biological and social data is increasingly being represented as graphs. Pattern-mining-based graph learning and analysis techniques report meaningful biological subnetworks that elucidate important interactions among entities. At the backbone of these algorithms is the enumeration of pattern space. Results We propose an efficient algorithm for enumerating all connected induced subgraphs of an undirected graph. Building on this enumeration approach, we propose an algorithm for mining all maximal cohesive subgraphs that integrates vertices’ attributes with subgraph enumeration. To efficiently mine all maximal cohesive subgraphs, we propose two pruning techniques that remove futile search nodes in the enumeration tree. Conclusions Experiments on synthetic and real graphs show the effectiveness of the proposed algorithm and the pruning techniques. On enumerating all connected induced subgraphs, our algorithm is several times faster than existing approaches. On dense graphs, the proposed approach is at least an order of magnitude faster than the best existing algorithm. Experiments on protein-protein interaction network with cancer gene dysregulation profile show that the reported cohesive subnetworks are biologically interesting.


Background
Mining interesting subgraphs from a large graph has been extensively studied. The modular structure has been observed in many real-world networks and shown to reveal insights into the intricate interactions that take place in real-world networks. Subgraph mining aims at discovering subgraphs that have interesting structural properties. Graph density, the ratio of present edges to the possible edges, has been the main property of interesting subgraphs. Abello et al. 2002 [1] proposed a greedy randomized algorithm for mining dense subgraphs. Matsuda et al. 1999 [2] introduced an approximation algorithm for mining a subset of the quasi-cliques present in a graph. A reverse-search-based algorithm for enumerating all dense *Correspondence: saeed.salem@ndsu.edu 1 North Dakota State University, Fargo, ND 58102, USA Full list of author information is available at the end of the article subgraphs from an unweighted graph has been proposed in [3,4].
Integrating node and edge attribute data with graph analysis has received attention since mining data from multiple sources has been shown to improve graph learning. In protein-protein interaction analysis, highly interacting proteins are more likely to form function modules. Functional module discovery can be aided by the integration of gene expression from multiple experiments as the genes in functional modules tend to have similar expression patterns [5,6]. Moreover, subnetworks with differentially expressed genes have been shown to be good subnetwork biomarkers [6,7]. Moser et al. [8] proposed the CoPaM algorithm for integrating the vertices' attributes with dense subgraph mining. A reverse-search algorithm was used for mining dense cohesive subgraphs from a weighted protein-protein interaction network with nodes' attributes have been proposed in [9]. Mining maximal homogeneous clique sets has been introduced in [10]. In Silva et al. [11], structural correlation mining was proposed for mining quasi-cliques that have correlated attributes.
In sparse attributed graphs, meaningful subgraphs can have very low density, yet exhibit high attribute similarity, e.g., biological pathways. Thus, it is important to mine connected subgraphs with high attribute similarity without the density constraint.
To achieve this goal, an algorithm for enumerating all connected induced subgraphs is needed as the backbone of the mining process. Additional attribute similarity constraints can be enforced while exploring the connected subgraphs search space. Moreover, the problem of enumerating all subgraphs is important in the field of computer-aided structure elucidation in cheminformatics for enumerating possible chemical graphs and stereoisomers [12]. The problem of enumerating all connected subgraphs might seem intractable since the number of these subgraphs can be exponential. However, in sparse graphs, the number of connected vertex sets is much smaller than the size of the power set of the set of vertices.

Related work
The naive brute force algorithm to solve this problem is to generate the power set of the vertices, and then remove the elements in the power set that does not represent a connected subgraph. Clearly this algorithm is inefficient since it generates the power set of vertices, most of which are not connected. Another brute fore algorithm to solve the problem is to generate only connected subgraphs. The algorithm starts with one vertex as a subgraph, and then adds a neighbouring vertex to it every time. The process ends if the extended sub-graph is already visited or there is no more vertices to add. The drawback of this approach is storing all visited subgraphs which can be exponential and searching for the presence of a subgraph each time we extend the subgraph. Maxwell et al. [13] introduced the BDDE algorithm for enumerating all connected induced subgraphs. The BDDE algorithm follows a breadth-first discovery, and depth-first extension to enumerate the subgraphs. The algorithms starts with one vertex every time, and enumerate the binomial tree of neighbours of that vertex. This will enumerate all subgraphs that consist of the chosen vertex and its neighbors. The next step will be to enumerate subgraphs beyond the direct neighbors of the chosen vertex, by following each path in the binomial tree and treating it as a local search node, and building a sub-binomial tree for the direct neighbors of all vertices in the path except the vertices that are already visited. All neighbors of a local search tree are marked as visited before recursively call the depth-first search function, that eliminates duplicates that might be generated if the same neighbors are reached again by continued depth search. For a complete graph, the BDDE algorithm could consume a total space of O 2 N−1 , where N is the number of vertices in the input graph. Constraints defined over the nodes' attributes can be integrating into the BDDE algorithm. Recently, the TGE algorithm for enumerating all induced connected subgraphs has been proposed [14]. This algorithm uses recursion to solve the problem. For a given vertex v, the connected vertex induced subgraphs are partitioned into two groups: subgraphs that include v, and subgraphs that do not include v. The connected vertex induced subgraphs included in the latter group can be enumerated by recursively solving the problem after deleting v. The former one can be solved again by partitioning the connected vertex induced subgraphs into two groups based on another chosen vertex. The author showed that the time complexity is O(1) for each solution by amortization.
Reverse Search is a powerful paradigm for enumeration. It was first introduced by Avis and Fukuda [15], and employed to solve several enumeration problems, including all induced connected subgraphs, spanning trees of a graph, maximal independent sets of a graph, and mining frequent bipartite episode from event sequences. The basic idea of Reverse Search is to arrange all subsets to be enumerated in a tree, where each node in the tree appears only once. The backbone of a reverse search algorithm is the definition of a parent operation that reduces a node to a unique parent node. By repeatedly applying the parent operation on any two different nodes in the search tree, they will be reduced to a shared canonical node, the root of the traversal tree. Once the child operation is defined by inverting the parent operation, we construct the enumeration tree by simply applying depth-first traversal, starting from the root.
A reverse search algorithm, RS-MST, for enumerating all induced connected subgraphs has been introduced in [15]. The parent operation employed for enumerating all induced connected subgraphs was based on the minimum spanning tree of the subgraph. For an induced connected subgraph, G, removing a vertex v that has a degree one in the minimum spanning tree of G cannot disconnect the subgraph. The authors in [15] proposed the child operation that reverses the vertex removal.
In this paper, we propose a novel reverse search algorithm for enumerating all induced connected subgraphs of a graph. Building on this enumeration approach, we propose an algorithm for mining all maximal cohesive subgraphs that integrates vertices' attributes with subgraph enumeration. To efficiently mine all maximal cohesive subgraphs, we propose two pruning techniques that eliminate futile search subtrees in the enumeration tree, resulting in significant improvement in the running time of the algorithm. To demonstrate the effectiveness of the proposed algorithms and the pruning techniques, we conducted experiments on synthetic and real-world graphs.

Methods
Let G = (V , E) be an undirected graph, where V = {v 1 , ..., v n } is the set of vertices, and E ⊆ V × V is the set of edges. For any vertex set U ⊆ V , let G(U) = (U, E(U)) denote the subgraph of G induced by U, whose edges include all the edges of G with endpoints in U. We call U a connected vertex set if G(U) is connected.
Problem Definition: Given an undirected graph G(V , E), enumerate all connected vertex sets, CIS(G).

CIS(G) = {U | U ⊆ V and G(U) is connected}
In this paper, we propose a linear-delay linear-space algorithm for enumerating all connected vertex sets of an undirected graph.

Reverse search
In reverse search, a pattern extension rule defines how to generate child search nodes from a parent search node in the search space. The basic idea of reverse search is to arrange all solutions to be enumerated in a tree, rooted at an empty set node (canonical object), where each node in the tree appears only once under a specific parent node. In reverse search, a parent operation determines the unique parent node of a search node. This operation can be repeatedly applied on any two different nodes in the search tree until they reach a shared canonical node, the root of the traversal tree. Once the parent operation is defined, a child operation can be derived. Building on the parent-child operation, we build a tree-shaped traversal route on the set connected vertex sets. We perform the depth-first search on the tree without having the tree in memory to enumerate all induced connected subgraphs.
In this section, we define the parent operation and a data structure that allows for efficient parent/child operations.

Parent child relationship
The following lemma is essential: is a connected graph, s, u ∈ U are two distinct vertices, and u is the vertex with the largest shortest path from s, then G(U − u) is connected.
Proof Assume that u is the furthest vertex away from s and deleting u results in a disconnected graph. This means that there exists at least one vertex u such that all paths between s and u go through u. So, the shortest distance between s and u is greater than the shortest distance between s and u. This contradicts our assumption that u is the vertex with the longest shortest path from s in G.
Clearly, we can choose any vertex in U, then find the furthest vertex away from it and delete it, and still get a connected subgraph with size |U| − 1. It does not matter which vertex to choose, and also does not matter if the chosen vertex has many vertices with the same furthest distance because deleting any of them will produce a connected subgraph. In this work, for defining a child/parent operation, we need to designate a vertex of the subgraph as the anchor vertex. We denote the vertex with the smallest vertex identifier (smallest vertex lexicographically) in U as anchor(U). Let v ∈ U be the vertex with the longest shortest path to s = anchor(U). If there are more than one vertex with the longest shortest path, we take the one with the largest vertex identifier. We refer to the vertex with the longest shortest path to s in a graph (G(U)) as the utmost vertex.
We define the parent graph for a subgraph as fol- is the parent subgraph of G(U) (Lemma 1). The parent operation simply deletes the utmost vertex of a subgraph. It also can be repeatedly applied on a subgraph until reaching the canonical object (empty set). Figure 1 shows how to repeatedly apply the parent operation on a graph until reaching the empty set. Now we derive the the child operation from the parent operation, as follow: Let U be a connected vertex set, s = anchor(U), u ∈ U is the utmost vertex of U, and v ∈ V \ U is connected to U. Then the subgraph induced by U * = U ∪ {v} is a child of G(U) if and only if v > s (lexicographically) and one of the following conditions holds: 1 The distance from s to v is greater than the distance from s to u, or 2 Both v and u have the same distance to s, but v is lexicographically greater than u.
if G(U * ) is a valid child of G(U), we call v a valid candidate of G(U), otherwise, we call it an invalid candidate of G(U). Figure 2a shows a sample graph, and Fig. 2b shows the enumeration tree of this graph. Every search node in the enumeration tree represents a connected induced subgraph. Figure 2b shows that search node {A, D} is extended with vertex C to produce {A, D, C}; the other possibility {A, D, B} is crossed to indicate that it is not a valid child. In the leftmost branch, vertex D cannot be added to search node {A, B, C} because distance(A, D) = 1 < distance(A, C) = 2. Under the subtree rooted at B, vertex C cannot be added to {B, D} because distance(B, D) = distance(B, C) = 1, but C is lexicographically less than D.
In the middle, search node {B, A} is crossed out because vertex A is less than the anchor vertex B.
Applying the parent operation. Repeatedly applying the parent operation on a graph. a The anchor vertex is A, and the utomst vertex is F. b After deleting vertex F, vertices C and E become the furthest vertices with the same distance away from A, so E is the utmost vertex. We reduce the subgraph by deleting vertex E. c-g We apply the same procedure until deleting the last vertex A

Distance-Array representation
One way to speed up Reverse Search is to design a data structure that speeds up testing for valid children. In this section, we describe a data structure to represent each subgraph to be enumerated, such that checking each valid child takes a constant time. Moreover, building the data structure of a valid child, given the data structure of the parent node, takes only O( ) where is the maximum degree of the input graph. Given a subgraph G = (V , E), we use a data structure of four arrays of size |V |. The U array holds the vertices of the subgraph in the same order they were visited. The C array holds the neighbors (candidates) of the subgraph. The D array holds the distance between the anchor vertex and all other vertices. And the P array keeps track of the parent of each vertex in U or in C; The parent of a vertex v is the vertex connected to it on the path to the anchor when v was first added to C. The anchor vertex does not have a parent vertex. Figure 3 shows a sample graph G of 14 vertices and 22 edges. The dashed vertices and edges represent the subgraph induced by the subset U = {2, 3, 4, 5, 7, 9}. The data structure for U is depicted in Table 1.
The anchor vertex of the induced subgraph G(U) is 2, and the utmost vertex is 9, with distance 3 away from the anchor vertex. The whole graph has 14 vertices labeled from 1 to 14. Only 6 vertices belong to the subset U and there are only five neighbors in C. Using this representation, we can easily determine the anchor vertex, since it is the first one in U, and the utmost vertex, since it is the  last vertex in U. We can also get the distance between any neighbor of the subgraph and the anchor vertex in O(1) by accessing the corresponding index in D.
Using this representation, we can, for instance, extend the subgraph G(U) with the valid neighbor vertex v = 10 to form the subgraph induced by the subset U * = {2, 3, 4, 5, 7, 9, 10}. We need to add neighbors of v = 10 to C array and update their distances in D to be 4, and their parent in P to be 10. This will take only O( ). The data structure representation of U * is shown in Table 2.
When backtracking, the P array is used to determine which candidates to be deleted from C. For instance, when backtracking from U * to U, we first delete last added candidates whose parent is 10 from C (11 and 12) and reset the values of these indices in the D and P arrays, then we delete the 10 vertex from U.
For further improvement, only valid candidates are kept in C to avoid redundant checking for valid candidates. We maintain three extra arrays to hold invalid candidates, the last added vertex to U when the candidate vertex became invalid, and the original index of the candidate vertex at C. We use information in these arrays to move these candidates back to their original indices in C when backtracking. Through extensive experiments, we noticed that the performance gain in the running time achieved

Complexity analysis
An algorithm is said to be a linear-delay algorithm if it takes linear time, in terms of input size, to compute the next solution given a solution, or to detect that there are no more solutions. In our case, we consider the time the algorithm takes to generate the first child subgraph, given the parent subgraph.   for v ∈ C do 8: if ISVALIDEXTENSION(U, v) then x = lastAdded(U) 18: if v < s then 19: return False

Maximal cohesive subgraphs
In many applications, we are only interested in connected subgraphs that meet a user-defined constraint. Let f : 2 V → R denote a scoring function that quantifies vertex sets. Moreover, given a threshold δ, the anti-monotone constraint guarantees that if the score of a vertex set is at least δ, then score of each subset of the vertex set is also at least δ, i.e., f (U) ≥ δ =⇒ ∀U * ⊂ U : f (U * ) ≥ δ In this section, we assume that the vertices in the graph are annotated with features. This leads to the undirected attributed graph G = (V , E, f ) where V is the set of vertices, E is the set of edges, and f : V → {0, 1} d is a function that maps vertices to d-dimensional binary vectors. We are interested in mining subsets of connected vertices that have similar features. A dimension j is a cohesive dimension for a vertex set(subgraph) if the value of the dimension is '1' in all the binary vectors of the vertices of the set; j is cohesive for Given a user-defined threshold S min , a subgraph G(U) is called cohesive, if the number of dimensions in A(U) is at least S min . The cohesive condition is an anti-monotone constraint where all the subgraphs of a cohesive graph are also cohesive. The set of all cohesive subgraphs for an attributed graph will have a large number of overlapping subgraphs since the subgraphs of a cohesive subgraph are also cohesive. To reduce redundancy in the output subgraphs, we require the subgraphs to be maximally cohesive. A subgraph is maximal cohesive subgraph if it does not have a supergraph that is cohesive, i.e., Problem Definition: Given an attributed graph G = (V , E, f ), and threshold S min , the problem of mining the set of maximal cohesive subgraphs is to enumerate the set: This problem can be addressed by employing the reverse search enumeration approach in algorithm 1 to enumerate all cohesive subgraphs and report only leaf search nodes that do not have any valid or invalid cohesive child nodes. For a highly-connected graph and a relaxed cohesive constraint, enumerating the entire search tree of all cohesive subgraphs is computationally expensive. In the following subsections, we describe pruning strategies to reduce the size of the enumeration tree by pruning entire search branches without missing any search nodes. The pruning strategies result in significant performance improvement.

Nodes with a preceding covering sibling
Let x and y be two neighbors of G(U) such that x is closer to anchor(U) than y (x ≺ U y), and G(U ∪ {x}) and G(U ∪ {y}) are cohesive subgraphs with A(U ∪ {y}) ⊆ A(U ∪ {x}), then none of them is a maximal cohesive subgraph, and any maximal subgraph that contains G(U ∪ {x}) will also contain G(U ∪{y}), and vise versa. Moreover, G(U ∪{x, y}) is also a cohesive subgraphs that can be reached from both G(U ∪ {x}) and G(U ∪ {y}), but is a valid child of only one of them. Note that since In this case, we can prune the search branch rooted at one of the two subgraphs.

Lemma 2 Let G(U ∪ {x}) and G(U ∪ {y}) be two cohesive subgraphs, x is closer to anchor(U) than y (x ≺ U y), and A(U ∪ {y}) ⊆ A(U ∪ {x}), then the search branch rooted at G(U ∪ {y}) can be safely pruned. Proof For a set of vertices
is a cohesive subgraph since x is connected to U and can be added to G(U ∪ {y} ∪ Z) without violating the attribute similarity constraint. This contradicts our assumption that G(U ∪ {y} ∪ Z) is a maximal cohesive subgraph. This proves that G(U ∪{y}∪Z) is not a maximal cohesive subgraph. Moreover, is not a descendant of G(U ∪{y}) since x is not valid extension once y is added to vertex set U because x is closer to anchor(U) than y. But all descendants of G(U ∪{y}) can be expressed as G(U ∪ {y} ∪ Z). So none of the descendants of G(U ∪ {y}) will be a maximal cohesive subgraph. Therefore, it is safe to prune the search branch rooted at G(U ∪ {y}) without losing any maximal cohesive subgraphs. In Fig. 4, the search branch rooted at C can be safely pruned because it is connected to B and and A(C) ⊆ A(B).

Nodes with the same features as its parent
This pruning strategy handles a special case where the attributes of a child node are identical to those of the parent node. After sorting neighbors of U, if there is a child U * such that A(U) = A(U * ), then all succeeding neighbors can be pruned safely using the previous lemma, because their descendants will be enumerated under the U * search branch. Although it looks like that this pruning operation is theoretically redundant of the first operation, it saves practically the time needed to check if the siblings are covered by any proceeding one. So once we observe that there is a node with the same feathers as the parent node, there is no need to check whether the succeeding neighbors are covered by this node. We will show in the experiments section that this pruning technique improves the performance.
In Fig. 4b, search node {A, B, F, H} has same features as its parent, hence, all its succeeding siblings can be pruned.

Results
We compare the performance of the proposed approach for enumerating all connected induced subgraphs to that of three existing algorithms on random graphs with varying graph size and density. Moreover, we test the running time on real enzymes. Moreover, to test the performance of the proposed approach for mining maximal cohesive subgraphs, we evaluate the performance on a real protein-protein interaction network with gene dysregulation profile in 13 cancer types as attributes. All experiments were performed on a machine with Intel Xeon 2.40 GHz processor with 16 Gbytes main memory, running the Linux operating system. The two reverse search enumeration approaches were implemented in C++. The TGE algorithm is implemented in C and the BDDE algorithm in Perl as provided by their respective authors.

Performance on random graphs
We generated random graphs with varying numbers of nodes and density. Figure 5a shows the running times on graphs with varying size while keeping the density at 0.6. Figure 5b shows the running times on random graphs with varying density while the number of vertices was set to 27. We can see that RS-SP runs about one order of magnitude faster. We can see that our proposed algorithm is at least an order of magnitude faster than the best competing algorithm (TGE) and two orders of magnitude faster that the BDDE and RS-MST algorithms. For graphs with larger number of nodes (> 28), the BDDE algorithm uses too much memory and crashes after 1 h. For larger graphs (> 31), the RS-MST did not finish the enumeration task in 27 h.

Performance on real data
We tested our algorithm on real chemical graphs downloaded from the network repository [16]. We compared against the TGE algorithm since it is the fastest among the competing algorithms. We ran both algorithms on ten graphs for which the running time is less than nine hours. For larger graphs, it takes days before we could get any results. Table 3 shows the running time of the TGE and RS-SP algorithms; RS-SP is several times faster than the TGE algorithm. Due to the nature of chemical compounds, most atoms (nodes) have a degree of at most 8 (maximal valence of atoms), and thus large chemical graphs are not dense. For these sparse graphs, the speedup is not high.

Cohesive subnetworks
We use the BIOGRID protein-protein interaction network (PPI) (version 3.4.160; May 2018) that has 287,970 interactions among 21,429 genes [17]. For attribute data, we used the gene dysregulation profile in 13 cancers. The dataset was generated from the gene and miRNA expression data of 13 tumor types and matched normal samples [18]. On average each cancer dataset had 2380 dysregulated genes and each gene was dysregulated in 3.4 cancers. We ran the algorithm with all the pruning techniques on the attributed BIOGRID network for varying minimum support. The algorithm was extremely fast finishing in less than one second for S min ≥ 6, and for S min = 2, and 1 it took 21 and 74 seconds, respectively.

Effectiveness of pruning techniques
To show the impact of the pruning techniques on the running time, we turned off the pruning techniques in the algorithm one at a time. Figure 6 shows the impact of the pruning techniques. For 1 ≤ S min ≥ 3, the algorithm without any pruning did not finish in 50 h, resulting in more than 400 speed up for each of the pruning techniques. Table 4 shows the topological properties and biological enrichment analysis for the genesets in the reported  [19,20] and the DisGeNET human gene-disease associations database [21] for assessing the enrichment of the genes in these reported patterns with these signatures. If a biological annotation is overrepresented in the reported subgraph's genes, the subgraph pattern is considered as enriched. The overrepresentation test is modeled as a hybergeometric test (with pvalue = 0.05) and we checked for enrichment in the following collections:

Maximal cohesive subgraphs analysis
1 Hallmark signatures: gene sets that represent biological processes and display coherent expression. 2 KEGG signatures: gene sets derived from the KEGG pathway database. 3 Oncogenic signatures: gene sets of dysregulated cellular pathways in caner. 4 DisGeNET curated human gene-disease associations database. Table 4 shows the percentage of patterns that are biologically enriched with different biological signatures. Some patterns are enriched with several signatures and some signatures are enriched in the genes of more than one pattern. Table 5 shows the biological signatures that were enriched the most in the reported patterns for S min = 9

Discussion and conclusions
We have proposed a new reverse search algorithm for enumerating all connected induced subgraphs in a single graph. Furthermore, we employed the proposed techniques for mining maximal connected subgraphs that satisfy a constraint defined over the attributes of the vertices. Leveraging on the order in which the subgraphs are enumerated, we proposed two pruning strategies that drastically reduce the running time of the algorithm by pruning search branches that will not result in maximal subgraphs. Experiments on both synthetic and real datasets demonstrate the effectiveness of the proposed approaches. Enrichment analysis of the reported protein-protein subnetworks whose genes are dysregulated in a number of cancers reveals that these subnetworks are biologically significant. Future work includes developing a parallel implementation of the algorithm and designing pruning strategies for real-valued vertex attributes.