Journal of Graph Algorithms and Applications a Polynomial Delay Algorithm for Generating Connected Induced Subgraphs of a given Cardinality

We give a polynomial delay algorithm, that for any graph G and positive integer k, enumerates all connected induced subgraphs of G of order k. Our algorithm enumerates each subgraph in at most O((k min{(n − k), k∆}) 2 (∆ + log k)) and uses linear space O(n + m), where n and m are respectively the number of vertices and edges of G and ∆ is the maximum degree.


Introduction
Let G = (V, E) be an undirected graph of |V | = n vertices and m = |E| edges. Given an integer k ∈ Z + , we consider the problem GEN(G; k) of enumerating (or generating) all subsets X ⊂ V of vertices such that |X| = k and the subgraph G[X] induced on X is connected. Let us denote the family of such vertex sets by C(G; k).
Typically, the size of C(G; k) is exponentially large in k. In fact, it was shown in [Ueh] that there exists a family of connected graphs G with maximum degree ∆ < 2n k , for which |C(G; k)| ≥ n( ∆ 2 ) k−1 . On the other hand, an upper bound of (e∆) k (∆−1)k on the size of C(G; k) was also given in [Ueh]. An enumeration algorithm for C(G; k) is said to be output polynomial if the algorithm outputs all the elements of C(G; k) in time polynomial in n and |C(G; k)|. Avis and Fukuda [AF96] introduced the reverse search method for enumeration, and used it to solve, among several other problems, the problem of enumerating all connected induced subgraphs of size at most k. Uehara [Ueh] noted that such an algorithm is total polynomial for enumerating C(G; k) when k is a fixed constant. In fact, since the algorithm of [AF96] enumerates the families C(G; i) for all 1 ≤ i ≤ k, the total the running time of the algorithm is bounded by (see [Ueh]): which is upper bounded by O n + m + (e∆) k (∆−1) k 2 log k , using the upper bound on |C(G; k)| mentioned above.
However, we note that such a bound is not polynomial, when k is part of the input. In fact, considering the lower bound example mentioned above, we observe that, when k = n − ∆, the size of C(G; k) is at most n k < n ∆ , while for all i < 2n ∆ , |C(G; i) ≥ n( ∆ 2 ) i−1 . Thus for i = 2n ∆ − 1, the total running time in (1) will be at least n( ∆ 2 ) 2n ∆ −2 . Setting, for instance, ∆ = n ǫ for some ǫ ∈ (0, 1 2 ), and assuming n is large enough, we get that i = 2n 1−ǫ − 1 < k = n − n ǫ , and thus the running time in (1) is at least Thus, the algorithm suggested in [Ueh] is not a total polynomial algorithm.
An enumeration algorithm for GEN(G; k) is said to have a polynomial delay of p(n, m, k) if it outputs all the elements of C(G; k), such that the running time between any two successive outputs is at most p(n, m, k). The algorithm is said to use a linear space, if the total space required by the algorithm (excluding the space for writing the output) is O(n + m). Recently, Karakashian et al. [KCH13] gave an algorithm with delay O(∆ k ) for solving GEN(G; k).
In this short note, we give a bound that is polynomial in k.
We remark that the polynomial delay bound can be improved if we do not insist on using polynomial space; see Corollary 1 below.
Our proof of Theorem 1 is also based on using the the reverse search method [AF96]. In fact we will consider more generally the supergraph method for enumeration, which can be thought of as a generalization of the reverse search method. This method will be briefly explained in the next section. Then we prove Theorem 1 in Sections 3 and 4.
The problem of enumerating of connected induced subgraphs of size k arises in several applications, such as keyword search over RDF graphs in Information Retrieval, and consistency analysis in Constrained Processing; see, e.g., [EB11,KCH13] and the references therein.

The Supergraph Approach
This technique works by building and traversing a directed (super)graph G = (C(G; k), E), defined on the family C(G; k). The arcs of G are defined by a neighborhood function N : C(G; k) → 2 C(G;k) , that to any X ∈ C(G; k) assigns a set of its successors N (X) in G. A special vertex X 0 ∈ C(G; k) is identified from which all other vertices of G are reachable. The algorithm works by traversing, either, in depth-first or breadth-first search order, the vertices of G, starting from X 0 . If G is strongly connected then X 0 can be any vertex in C(G; k).
To avoid confusion, in the following, we will distinguish the vertices of G and G by referring to them as vertices and nodes, respectively.
The following fact is known about this approach (see e.g. [AF96, BEGM09, JPY88, SS02]): Proposition 1 Consider the supergraph G and suppose that (i) G is strongly connected; (ii) a node X 0 in G can be found in time t 0 (n, m, k); (iii) for any node X in G, |N (X)| ≤ N (n, m, k) and we can generate N (X) with delay t(n, m, k); , and f satisfies the following acyclicity property: there exist no X ℓ+1 = X 1 , X 2 , . . . , X ℓ ∈ C(G; k) such that X i = f (X i+1 ) for i = 1, . . . ℓ, then C(G; k) can be generated with delay O(max{t 0 (n, m, k), (t(n, m, k)+t 1 (n, m, k))· N (n, m, k)} and space O(n + m).
The proof is straightforward. For the first claim, we essentially traverse a breadth-first search (BFS) tree on G, starting from node X 0 . We maintain a balanced binary search tree (BST) on the elements generated, sorting them, say, according to some lexicographic order. We also keep a queue of all elements that have been generated but whose neighborhoods have not been yet explored. When processing a node X in the queue, we output X and then generate all its neighbors in G, but only insert in the queue a neighbor X ′ if it has not been yet stored in the BST.
To achieve the second claim, we instead traverse a depth-first search (DFS) tree on G, starting from node X 0 . (This is essentially the reverse search method) When traversing a node X, we generate all neighbors of X in G, but only proceed the search on a neighbor X ′ , if X = f (X ′ ) is the unique "parent" defined in (iv). The fact that f is a function satisfying the acyclicity property implies that all the nodes in G are processed exactly once. In order to obtain the claimed delay bound, we output a node X just after the first visit to it, if the depth of X in the tree is odd (assume the root X 0 has depth 1), or X is a leaf; otherwise, X is output just before coming back to the parent. Note that we do not need to store the history along any search path since the parent of any node can be generated efficiently.
Clearly, we may and will assume in the rest of the paper that the graph G is connected. In the next section, we will prove that this assumption implies that the supergraph G in our case is strongly connected (and in fact has diameter δ ≤ n + k − 2). Let us now set the other parameters in Proposition 1, corresponding to our problem GEN(G; k). We assume an order on the vertices of G, which also defines a lexicographic order " " on the vertex sets. Clearly, the lexicographically smallest node X 0 ∈ C(G; k) can be found in time t 0 (n, m, k) = O(k∆) by performing a DFS starting from the smallest vertex in G and processing vertices in order.
In order to achieve a polynomial space bound, we will need a stronger claim, namely that every node X ∈ C(G; k) \ {X 0 } is reachable from X 0 by a monotonically increasing lexicographically ordered sequence of nodes. This claim will be proved for any DFS ordering on the vertices of G, and allows us to identify the parent X ′ of any node X by defining, for instance, X ′ = X ∪ {u} \ {v}, where u ∈ V \ X and v ∈ X are respectively the smallest and largest vertices such that X ′ ≺ X and the graph G[X ∪ {u} \ {v}] is connected. Note that finding X ′ can be done in time t 1 (n, m, k) ≤ t(n, m, k) · min{(n − k), k∆}.

The Neighborhood Operator for C(G; k)
For a set X ∈ C(G; k), it is natural to define the neighbors of X as those which are obtained from X by exchanging one vertex: It is worth comparing our neighborhood definition to the one suggested in [AF96] for generating all connected induced subgraphs of size at most k. In the latter definition, two sets X, X ′ ⊆ V are neighbors if they differ in exactly one vertex. The claim of strong connectivity follows immediately from the fact if X ⊂ V is such that G[X] is connected then there is a vertex u ∈ X such that G[X \ {u}] is also connected, and similarly, there is a vertex u ∈ V \ X such that G[X ∪ {u}] is connected. The proof of these two facts follows easily from the simple fact (also used in our proof) that any tree has at least two leaves.

Strong Connectivity
We prove first that the supergraph G is strongly connected.
Proof. Let d(Z, Z ′ ) be the (shortest) distance between the two vertex sets Z, Z ′ in G. Suppose we have already constructed X i . We consider two cases. Case 1. d(X i , Y ) > 0 (and hence X ∩ Y = ∅). Let u 0 , u 1 , . . . , u r be the ordered sequence of vertices on the shortest path between X i and Y in G, where u 0 ∈ X i and u r ∈ Y . Let T be a spanning tree in G[X i ]. Then T has a leaf v = u 0 . We define X i+1 = X i ∪ {u 1 } \ {v}. By construction, X i+1 ∈ C(G; k), and d(X i+1 , Y ) < d(X i , Y ). Thus, after at most n − 1 iterations, we will arrive at case 2. Case 2. d(X i , Y ) = 0. Then there exists a vertex z ∈ X i ∩ Y . Let C(X i ; z) be (the vertex set of) the connected component containing z in G[X i ∩Y ]. We claim that there exists a vertex set X i+1 ∈ N (X i ) such that |C(X i+1 ; z)| > |C(X i ; z)|. Indeed, let us contract C(X i ; z) into a single vertex w and denote the new graph by G ′ . Then, by the connectivity of G[Y ], there is an edge {w, u} in the graph is connected and hence has a spanning tree T . Let v = w be a leaf in T . Then X i+1 = X i ∪ {u} \ {v} satisfies the claim. This claim implies that in at most k − 1 iterations of case 2, we will have X i+1 = Y .
In view of Proposition 1, Lemma 1 implies that all the elements of C(G; k) can be enumerated with polynomial delay.
In order to prove that GEN(G; k) can be solved also with polynomial space, we need the following result.
Lemma 2 Consider the lexicographic ordering " " on C(G; k), defined by a DFS order on the vertices of G. Let X be any element in C(G; k), that is not lexicographically smallest. Then there is an X ′ ∈ C(G; k) such that X ′ ∈ N (X) and X ′ ≺ X.
Proof. Let us denote the lexicographically smallest element of C(G; k) by X 0 . Since X 0 ≺ X, it holds that w := min u∈X0\X u < z := min u∈X\X0 u. We consider a number of cases: Case 1. z is not a cut-vertex in G[X] (that is, G[X] − z is connected). We consider two subcases.
Subcase 1.1. The DFS-tree walk (that is, the sequence of vertices visited on the way in DFS-order) from w to z enters X at a vertex y = z through an edge {x, y}. Thus y ∈ X, x ∈ X and x < y < z. Since z is not a cut vertex, there is a spanning tree in G[X] which has z as a leaf. Then X ′ = X ∪ {x} \ {z} satisfies the claim. Subcase 1.2. The DFS-tree walk from w to z enters X at z through an edge {x, z}. If all vertices in X \ {z} are visited after z in the DFS order, then let y = z be a leaf in a spanning tree in G[X], and set X ′ = X ∪ {x} \ {y} which satisfies the claim. Otherwise, there is a vertex y in X that precedes z in the DFS order. Then y necessarily precedes w (otherwise we are in case 1.1), and the DFS walk between y and w must exit X through some edge {u, v} with u ∈ X and v ∈ X such that v < w < z. Then setting X ′ = X ∪ {v} \ {z} satisfies the claim. Let v be a leaf in a spanning tree in G[X 2 ]. We consider two subcases. Subcase 2.1. The DFS-tree walk from u to v does not go trough z. Then it must be the case that the walk leaves X 2 , and hence X, through an edge {x, y}, where x ∈ X 2 and y ∈ X. Since y < v, the set X ′ = X ∪ {y} \ {v} satisfies the claim. Subcase 2.2. The DFS-tree walk from u to v does go trough z. Then z < v and the DFS walk from w to z enters X at a vertex x = v through and edge {y, x}, where y ∈ X. Since y < x ≤ z < v, we can set X ′ = X ∪ {y} \ {v} to satisfy the claim.