Journal of Graph Algorithms and Applications the Complexity of the Simultaneous Cluster Problem

We study clustering over multiple graphs-each encoding a distinct set of similarity relationships (edges) over the same set of objects (nodes)-where the aim is to identify clusters that are supported across the collection of graphs. This problem of simultaneous clustering is readily motivated by the recent deluge of datasets in several domains (including the biological sciences, social sciences, and marketing), where the same objects are repeatedly measured in different conditions, populations or time points. Whilst there has been a vast amount of heuristic work on practical simultaneous clustering problems, little is known on the theoretical side – we present theoretical results that help explain why such heuris-tics typically come without quantitative guarantees. We give algorithmic and complexity results for simultaneous clustering using two standard measures on clustering quality: density and connectivity. Specifically, we focus on the basic problem of finding a single cluster (rather than an entire clustering) that is simultaneously of high quality in every graph. When the quality of a cluster is its minimum density over all graphs, we show the problem is not approximable within a factor of 2 log 1−ε n , unless N P ⊆ DT IM E(n polylogn). Furthermore, this problem appears very difficult even when there are just two graphs; the resulting problem is approximately as hard as the problem of finding a dense subgraph on at most k vertices. When cluster quality is a fixed connectivity requirement between terminals within the cluster, there are two natural optimization problems: a maximization version (find a good quality cluster with as many terminals as possible) and a minimization version (find a good quality cluster that is as small as possible). We show that the maximization problem is tractable in polynomial time for any fixed connectivity requirement k. On the other hand the minimization problem is hard to approximate within a factor of 2 log 1−ε n , unless N P ⊆ DT IM E(n polylogn). The number of graphs in our reduction depends on n. If instead the number of graphs is fixed, we show there is an ε > 0 for which the minimization problem is not approximable within g 1/2−ε for any fixed number g of graphs unless N P = ZP P. These hardness results for the minimization problem hold even in the simple cases where the connectivity requirement is one and there are either just two terminal nodes or every node …


Abstract
We study clustering over multiple graphs -each encoding a distinct set of similarity relationships (edges) over the same set of objects (nodes) -where the aim is to identify clusters that are supported across the collection of graphs.This problem of simultaneous clustering is readily motivated by the recent deluge of datasets in several domains (including the biological sciences, social sciences, and marketing), where the same objects are repeatedly measured in different conditions, populations or time points.Whilst there has been a vast amount of heuristic work on practical simultaneous clustering problems, little is known on the theoretical side -we present theoretical results that help explain why such heuristics typically come without quantitative guarantees.We give algorithmic and complexity results for simultaneous clustering using two standard measures on clustering quality: density and connectivity.Specifically, we focus on the basic problem of finding a single cluster (rather than an entire clustering) that is simultaneously of high quality in every graph.When the quality of a cluster is its minimum density over all graphs, we show the problem is not approximable within a factor of 2 log 1−ε n , unless N P ⊆ DT IM E(n polylogn ).Furthermore, this problem appears very difficult even when there are just two graphs; the resulting problem is approximately as hard as the problem of finding a dense subgraph on at most k vertices.When cluster quality is a fixed connectivity requirement between terminals within the cluster, there are two natural optimization problems: a maximization version (find a good quality cluster with as many terminals as possible) and a minimization version (find a good quality cluster that is as small as possible).We show that the maximization problem is tractable in polynomial time for any fixed connectivity requirement k.On the other hand the minimization problem is hard to approximate within a factor of 2 log 1−ε n , unless N P ⊆ DT IM E(n polylogn ).The number of graphs in our reduction depends on n.If instead the number of graphs is fixed, we show there is an ε > 0 for which the minimization problem is not approximable within g 1/2−ε for any fixed number g of graphs unless N P = ZP P .These hardness results for the minimization problem hold even in the simple cases where the connectivity requirement is one and there are either just two terminal nodes or every node is a terminal node.We remark that our results extend to case where more robust variants of the quality measure are used.

Introduction
The problem of clustering -partitioning a set of objects into similar groups based upon a graph of similarity relationships defined over the objects -is ubiquitous.Applications abound in data mining, with clustering being a primary choice for exploratory data analysis in various domains such as biology [16], medicine [44], marketing [25], and social network analysis [45].Our interest in clustering derives from the recent, rapid accumulation of datasets in such domains, where measurements are taken on the same set of objects repeatedly under different experimental conditions, time points, or populations.This yields a collection of graphs defined over the same set of objects (nodes) but with different sets of relations (edges) amongst them.This, in turn, calls for a new paradigm of clustering that jointly analyses multiple graphs to identify common signals and conserved clusters.This paradigm is very relevant in the biological sciences for instance, where the replication of a discovery (for example, functional similarity of a set of genes) is often sought across multiple, independent datasets to minimize spurious findings caused by noise/artifacts in individual datasets and to exploit the complementarity of the datasets [30,21,39].With advances in high-throughput instruments, there is a deluge of molecular data on the same biological system generated using different experimental backgrounds, perturbation techniques and technological platforms.
Each dataset comes with its own set of biases and artifacts due to these differences, and calls for methods that integrate diverse datasets more carefully than simply concatenating or combining them into one dataset or similarity graph prior to clustering.Machine-learning methods could be used to carefully integrate multiple datasets into one similarity function, but they typically rely heavily on domain knowledge in the form of training data and model assumptions [22].We are interested in a problem abstraction that naturally extends single-graph clustering to multiple graphs and is suitable for the exploratory or "unsupervised" setting where there is no training data.
Our goal, therefore, is to obtain a clustering that is good over a collection of graphs, G = {G 1 , G 2 , . . ., G t } that share the same set of nodes.We dub this problem simultaneous clustering.Of course, in order to assess whether a clustering is good we must specify a measure of quality.For example, in this paper we use perhaps the two most natural and widely-studied attributes associated with a cluster, namely density and connectivity.Thus, a clustering will be good if it induces dense or highly connected clusters in each of the graphs G i , even though the actual edge sets induced may vary widely between the graphs.In Section 2 we will see how these two measures arise in biological studies aimed at discovering sets of functionally coherent genes and complexes/scaffolds of interacting proteins.First, though, we formalise the problem and state our results.

Our Results.
We are given a collection of graphs G = {G 1 , G 2 , . . ., G t }, where G i = (V, E i ) for each 1 ≤ i ≤ t, and a quality measure.A clustering is a partition of V into subsets S 1 , S 2 , . . ., S ; each S i is called a cluster.We restrict our attention to the fundamental problem of finding a single cluster S ⊆ V that is good, that is, has at least a specified quality q * in the subgraph G i [S] it induces in each graph G i .We call this the simultaneous cluster problem and show that it is polynomially tractable in a few cases but is typically very hard.

Simultaneous Cluster Problem.
Input: Graphs G i = (V, E i ), where 1 ≤ i ≤ t, and a quality threshold q * .Objective: A cluster S ⊆ V such that the quality of G i [S] is at least q * for all i.
As stated the two quality measures we will consider are density and (terminal) connectivity.
• We define the density of a cluster S in a collection of graphs to be den(S; G) = min is the set of edges in the graph G i [S] induced by the vertex set S.
• Given a set of terminals T ⊆ V , we define the (terminal) connectivity of a cluster S in a collection of graphs to be where κ i (S) is the minimum pairwise connectivity between terminals T ∩ S in For the density measure, our first result shows that there is major difference in hardness when we move from a single graph to just two graphs.Specifically the densest subgraph problem is polynomial time solvable with one graph (see Chapter 4 of [29], [34] and [19]), but for two graphs we prove the following for densest simultaneous cluster: Here Densest k-Subgraph refers to the problem of finding the densest subgraph on at most k vertices given an input graph G and a number k.This problem can be approximated to within a factor of O(n 1 4 +ε ), due to a recent breakthrough result of [9].Our result is of interest because it is widely believed [10,3,17,18] that the hardness of Densest k-Subgraph is also close to this upper bound -indeed, Bhaskara et al. [10] present O(n Ω(1) ) lower bounds for lift and project methods based upon the Sherali-Adama and the Lassere hierarchies.If so, whilst a size restriction is clearly vital with regards to complexity in the case of a single graph, it is redundant in the case of two graphs -there the problem is very hard even when no size restrictions are given.
To complement this result, we show that the problem does have large inapproximability bounds when the number of graphs gets large.Theorem 3. Densest Simultaneous Subgraph is not approximable within 2 log 1−ε n for any ε > 0, unless N P ⊆ DT IM E(n polylogn ).
In fact, this hardness result also applies to the problem of finding a minimum cardinality subset that has non-zero density in each graph, i.e. den(S; G) > 0. That is, the simple problem of finding the smallest cluster that induces at least one edge in many graphs is very hard to approximate.So if in an application the functionality (quality) of a cluster S is defined to simply depend upon whether or not at least two nodes in that cluster can interact then, from an approximation viewpoint, we are already in trouble!This helps explain why heuristics for many clustering problems with more complex quality measures, e.g. in bioinformatics, typically come without quantitative guarantees.
For the terminal connectivity measure, we fix the desired connectivity k for determining whether a cluster is good and study two natural optimization criteria.We first present good news for finding a good cluster with as many terminals as possible.
Theorem 4. For a fixed connectivity requirement k, there is a polynomial time algorithm for Maximum Simultaneous k-Connected Steiner Cluster.
As connectivity is a monotonic property with regards to the addition of non-terminal nodes, this maximization criteria could produce large clusters that contain extraneous nodes in some scenarios.So we also study the problem of finding a good cluster with as few nodes as possible.We show this is hard to approximate even in the extreme cases of just two terminals {s, t} or all nodes being terminals, even when the connectivity requirement k = 1.
In fact, we obtain inapproximability results that scale with the number of input graphs.
Theorem 7. Simultaneous s-t Path is not g 1/2−ε -approximable for some ε > 0 where g is the number of graphs unless N P = ZP P .Theorem 8. Minimum Simultaneous Connected Steiner Cluster is not g 1/2−ε -approximable for some ε > 0 where g is the number of graphs unless N P = ZP P .
These hardness results for clustering many graphs also extend to robust variants of the problems where the optimal solution is only required to satisfy the quality (density or connectivity) constraint in a c fraction of the g input graphs.This follows readily as an algorithm for the robust variant (c < 1) can be used to solve an instance of the exact variant (c = 1) by adding (g/c − g) empty graphs.For example, this would mean that maximizing the median density (c = 1  2 ) of a subgraph in the input graphs is at least as hard as maximizing the minimum density of a subgraph in the original graphs.When clustering many graphs, if we let the optimal solution satisfy the quality constraint in all graphs as in the original problem definitions, but relax the approximation algorithm to return a solution that satisfies the constraint in only a c fraction of the input graphs, this c-relaxed approximate solution is still hard to find.Theorem 9. Densest Simultaneous Subgraph is not c-relaxed approximable within 2 log 1−ε n for any ε > 0 and constant c > 2  3 , unless N P ⊆ DT IM E(n polylogn ).
Theorem 10.Simultaneous s-t Path is not c-relaxed approximable within 2 log 1−ε n for any ε > 0 and constant c > 1  2 , unless N P ⊆ DT IM E(n polylogn ).Theorem 11.Minimum Simultaneous Connected Steiner Cluster is not c-relaxed approximable within 2 log 1−ε n for any ε > 0 and constant c > 4  5 , unless N P ⊆ DT IM E(n polylogn ).
We prove our results for the density and connectivity measures in Sections 4 and 5, respectively.Before doing so, in Section 2, we describe in detail how the problem of simultaneous clustering arises naturally in bioinformatics, and discuss the techniques and heuristics currently used for such problems.We then compare, in Section 3, our problem to previous work in stochastic optimization where there are multiple inputs (or scenarios).

Simultaneous Clustering in Bioinformatics
The major motivation underlying this work is the abundance in bioinformatics of simultaneous clustering problems based upon connectivity and, especially, density quality measures.So, in this section we give a detailed and slightly technical overview of why such problems arise and give a guide to some of the research that has been carried out in this area.This provides context for our research but a reader solely interested in the theoretical aspects of the underlying combinatorial problem may chose to proceed to the next section.
Interactions between genes, proteins and other molecules form the basis of most cellular processes, and large-scale measurements of such interactions are now routine in the life sciences [23].For instance, it is possible to monitor the activity or expression patterns of thousands of genes in an organism across many replicates, and currently more than 22,000 such expression datasets from different studies are available in a public resource called GEO [8].An expression dataset can be used to build a coexpression network, whose nodes are monitored genes and whose edges are gene pairs with similar activity patterns.If the activity patterns are measured in a sufficient number of systematically perturbed replicates, the edges in a coexpression network correspond to functionally related gene pairs.This idea is central to a large number of bioinformatic studies that discover new (or characterize known) biological processes by systematically identifying densely connected clusters in the coexpression network [48,16].A similar approach is widely used to identify connected scaffolds or dense complexes of physically interacting proteins from a genome-wide network of protein-protein interactions [36].
The joint analysis of multiple biological graphs is becoming increasingly important for two major reasons.The first reason is statistical -each dataset is a noisy measurement of the true functional relation of genes, hence discoveries (functionally coherent genes/protein clusters) supported by independent coexpression or protein interaction networks are more robust against artifacts in individual datasets [39,30].The other reason is biological -interesting insights into the evolution and regulation of biological systems are sometimes possible only by integrating diverse datasets obtained from different species, cell types or conditions [37,27].
Several techniques and heuristics are employed to address the related problems above.A common strategy is to frame the problem of finding protein complexes in a single protein network [36] or finding evolutionarily conserved complexes in multi-species networks [37,27] as locating heavy subgraphs in a single weighted "alignment graph".The node and edge weights of this alignment graph aggregates the features of each input network using a biologicallymotivated scoring scheme or Bayesian model.A node in the alignment graph for instance could represent a gene in the input networks for genes exhibiting oneto-one evolutionary relationship in multiple species and a gene family for genes in one species that are related to multiple genes in other species.A heuristic that starts with seed nodes and greedily adds or removes nodes to these seeds is then used to optimize the score of the induced subgraph of the alignment graph.When certain criteria based on the connectivity and monotonic local similarity between proteins in different species were used to define evolutionary conservation, a provably efficient algorithm based on a recursive approach was possible for finding conserved protein complexes [31].
The problem of finding connected subnetworks in one network (protein network) that is dense or high-scoring in another network (coexpression network) has been addressed using greedy heuristics too [43].Spectral techniques found use in a related problem of finding a clustering that maximizes the connectedness of each cluster and minimizes the weight of edges lost between the clusters in all input biological networks [32].Different notions of terminal connectivity were explored to find protein interactions that optimally explain the differential activity of a set of genes and thereby expand our current knowledge of proteins/genes involved in certain biological processes [46].Algorithms for finding k-cliques (for small k) have been used as subroutines to uncover the structure and evolution of overlapping clusters in biological and social networks [33,49].Recently, a study used simulated annealing to detect disease-specific genes that clustered in hundreds of coexpression networks [30].So it is not exactly a steiner problem but there are some similarities.
Clearly, the exact models, heuristics and algorithms used in the multi-graph methods above are driven mainly by biological considerations.As stated, our aim in this paper is to provide a computational treatment of the underlying simultaneous clustering problem.In particular, whilst we show that good algorithms are possible with some quality measures, our main contribution is to give an explanation for why quantitative guarantees have been elusive in previous works.

Related Work
Our work bears some relation to the field of stochastic optimization which encompasses optimization problems that are robust to uncertainty in the input data.The uncertainty is modeled by a probability distribution over possible realizations (scenarios) of the input data, and the objective function involves minimizing the expected cost (or maximizing the expected profit) of the algorithm [12].The framework also includes other robustness measures such as minimizing the maximum cost across all (or a large fraction of) scenarios [42,41] or permitting the cost in each scenario to be worse by a factor of p than the optimal cost in that scenario [2,38,1].
Given this generic definition, the simultaneous clustering problem could be considered as a stochastic optimization problem where the graphs with different edge weights are the different scenarios and we seek a set of common clusters that are robust (of good quality) in all input scenarios (graphs).Existing works on approximation algorithms or complexity results of stochastic optimization problems focus either on problems not closely related to clustering such as covering problems or finance-related problems, or on facility location problems that differ in several ways from the clustering model considered in this work.
For the simultaneous clustering problem, our objective is to minimize the maximum cost across all scenarios (the so-called min-max objective).Complexity results have been obtained for non-clustering problems with this objective.Strong NP-hardness is known for the shortest path problem [47], the assignment problem (bipartite matching) and the knapsack problem [26].Set cover with min-max objective is known to be hard to approximate (as hard as Densest k-Subgraph) [4].These results are for the cases where the number of scenarios is also given as input.Weak NP-hardness results are also known when the number of scenarios is fixed [2].Our inapproximability results for simultaneous clustering with the density measure apply when there are only two scenarios, also reducing from Densest k-Subgraph (our reduction differs from the one given for set cover).To our knowledge, the closest work to ours is for the minmax version of the k-centre problem [11].There the problem is studied with different scenarios in order, for example, to account for the congestion effects of rush hours.They gave a simple but elegant 3-approximation algorithm for the case of two scenarios but show the problem is inapproximable for three scenarios.As well as the quality measure, their work differs from ours in one important aspect.Whilst the single time-interval version of the k-centre problem can be viewed as a clustering (around centres) problem on one graph, the min-max variant is not a clustering problem because nodes can be serviced by different centres in different scenarios.Indeed, it is easy to show that the simultaneous clustering version of the k-centre problem has a factor 2-approximation for any number of graphs, as it reduces to the single graph case.There is also a rich body of work on other stochastic uncapacitated facility location (SUFL) problems where the objective is to find an optimal set of facilities to robustly serve a set of clients.The uncertainty could be in the demands of the clients (eg., which clients need service), the client locations and hence their distances to the facilities or other input parameters, and are modeled using single/multiple stage stochastic models [40,4,38].These problems typically differ from ours in many respects: in the choice of measure and objective function, in that they cease to be clustering problems in the multiple scenario case, and in that they only use a single distance metric between the clients across all scenarios (eg.[4]) 1 .

The Density Measure
To begin our study into the simultaneous cluster problem, we consider the density measure.
(Here den i is the density of the graph induced by S in G i .) For the "non-simultaneous" case of a single graph, that is t = 1, Densest Simultaneous Subgraph is equivalent to the densest subgraph problem and so is solvable in polynomial time [29,34,19].For the simultaneous case, in this section, we consider the complexity of the cases t = 2 and t large.We reduce the two graphs problem to the single graph problem where the solution is restricted to have at most k vertices, a problem widely believed to be difficult to approximate.We reduce the case where t is large to LabelCover-Max and consequently, show this problem is inapproximable within 2 log 1−ε n for any ε > 0, unless N P ⊆ DT IM E(n polylogn ).

Clustering Two Graphs
So let's consider the simultaneous cluster problem with exactly two graphs under the density measure.As noted, finding the densest subgraph in a single graph is easy.This is certainly not the case with two graphs.Specifically, here we show that finding a vertex set that simultaneously induces dense subgraphs in two graphs is approximately as hard as finding a densest subgraph on at most k vertices in a single graph: E) and a number k.
Objective.An induced subgraph H * of maximum density containing at most k vertices.
To obtain this hardness result, we begin by showing how a polynomial time algorithm for Densest Simultaneous Subgraph in two graphs would lead to a polynomial time algorithm for Densest k-Subgraph.Then, we adapt those techniques to show how inapproximability bounds (whatever they may be!) are also roughly maintained between these two problems.
Theorem 1.If we can solve Densest Simultaneous Subgraph on two graphs in polynomial time then we can solve Densest k-Subgraph in polynomial time.
Proof.Note that, for a fixed n, there are at most n 3 possible different density values.Therefore, we can assume that the optimal density d is fixed; that is, we know d whenever needed.Now, given an instance (G, k) of Densest k-Subgraph we reduce it to an instance of Densest Simultaneous Subgraph on two graphs, G 1 and G 2 .We actually build G 1 and G 2 out of two graphs, G 1 and G 2 , on disjoint vertex sets by taking their disjoint union.So edges in G 1 have both endpoints in G 1 and edges in G 2 have both endpoints in G 2 .We use the notation Obtaining G 2 is a little more complex.We desire G 2 to have the following two properties: (I) It is a minimum cardinality graph with exactly dk edges.
(II) All of its proper subgraphs are strictly less dense.
Observe that if G 2 satisfies Property (I) then it must have density d 2 = dk n2 .Furthermore, since G 2 contains as few vertices as possible, and thus, dividing by n 2 /2, we obtain Now G 2 contains r ≥ 0 edges less than the complete graph on n 2 vertices, K n2 .It must be the case that r ≤ n 2 − 2, otherwise the clique K n2−1 has at least as many edges as G 2 .So, we can construct G 2 by removing r edges from K n2 .We need to choose these edges judiciously, in order for Property (II) to hold.Towards this goal let P = {e 1 , e 2 , . . ., e n2−1 } form a Hamiltonian path in K n2 .Let M 1 consist of the odd indexed edges in P , and let M 2 be the even edges.Then to build G 2 we remove the r edges by first deleting edges of M 1 and then deleting edges of M 2 in reverse order.
Suppose that we are required to remove edges from M 2 , that is, r > 1 2 n.Then the maximum degree, ∆(G 2 ), is n 2 − 2 and the minimum degree, δ(G 2 ), is n 2 − 3.If not, the maximum and minimum degrees are bounded by n 2 − 1 and n 2 − 2, respectively.We may now show that G 2 does satisfy Property (II).
Proof.To show every proper subgraph H of G 2 has lower density, we consider three cases.Two cases are simple.If H has n 2 vertices then, as it is a proper subgraph of G 2 , it has fewer edges so is less dense.If H has at most n 2 − 2 vertices then the maximum degree ∆(H) is at most n 2 − 3. Consequently, the average degree in H is at most n 2 − 3.However, G 2 has average degree strictly greater than n 2 − 3, as, by construction, it always has a vertex of degree at least n 2 − 2. So H is less dense.So consider the case where H has n 2 − 1 vertices.Then and thus The last inequality holds provided We may now complete the description of our Densest Simultaneous Subgraph instance (G 1 , G 2 ).Given G 1 and G 2 , as above, set Note that since dk = n 2 d 2 , the two terms inside the min are the same.Since we assumed we know the optimal density d in G, the optimal solution to our instance of Densest Simultaneous Subgraph has value at least D * .(Algorithmically, if the optimum is less than D * , we can stop our search for this value of d and claim that the optimal density for G is lower than d.)It remains to show that an optimal solution H 1 ∪ H 2 of value at least D * , where We now have several cases to consider.
n2+k = D * and equality holds only when |V (H 1 )| = k.In that case, we can return H 1 as it then has size k and density

JGAA, 18(1) 1-34 (2014) 13
Lemma 1.If we can approximate Densest Simultaneous Subgraph on two graphs within a factor of α then we can approximate Densest k-Subgraph within a factor 2α with a solution of size at most (2α − 1)k.
Proof.Again, we assume the optimal density d is known when needed.Given an instance (G, k) of Densest k-Subgraph we reduce it to an instance (G 1 , G 2 ) of Densest Simultaneous Subgraph as before.Again, if there is a subgraph H in G = G 1 of cardinality k and density d then the value of solution H ∪V (G 2 ) in our instance of Densest 2-Simultaneous Subgraph is Note that since dk = n 2 d 2 , the two terms inside the min are the same.Moreover, as n 2 is the cardinality of the smallest graph with dk edges, it must be the case that k ≥ n 2 .So k = βn 2 for some β ≥ 1.Now take a solution H 1 ∪ H 2 output by the approximation algorithm, where We will show that H 1 is an approximate solution to the instance of Densest k-Subgraph.Again, assume that We now have several cases to consider.
Thus we obtain a contradiction.
[By algebra, as This is a contradiction.So H 1 is at most a factor 2α − 1 larger than H.
Again, the last inequality arises as β ≥ 1. Hence So H 1 contains at most (2α − 1)k vertices and has a density within a factor 2α of the densest subgraph on k vertices in G 1 so we can return H 1 as an approximate solution.
Lemma 2. Let G be a graph on k 1 vertices and density d 1 then for any k 2 = γk 1 , there exists a subgraph of G on k 2 vertices with density γd 1 .
Proof.Randomly choose a subset V 2 of size k 2 with each subset equally likely.
Then each edge appears with probability in the subgraph H induced by V 2 .Since the total number of edges in G is d 1 k 1 , it follows that the expected number of edges in H is

Clustering Many Graphs
Our hardness result for two graphs is compelling but, given the current state of knowledge, it still remains possible that there are constant factor approximation algorithms for Densest Simultaneous Subgraph in two graphs.For the case of many graphs, however, we are able to obtain much stronger inapproximability results.Specifically, we give a reduction from LabelCover; this is one of the six canonical inapproximable problems described by Arora and Lund [5].We will need its maximization version.The following gap-preserving reduction for LabelCover-Max is known, and follows from the PCP Theorem [6, 7] and Raz's Parallel Repetition Theorem [35].

LabelCover-Max Problem:
Theorem 12.For any ε > 0, an instance of Sat can be transformed in quasipolynomial time into a d-regular instance of LabelCover-Max such that • if the original instance of Sat is satisfiable then the instance of LabelCover-Max has a solution of value 1, • if the original instance of Sat is not satisfiable then all solutions to the instance of LabelCover-Max has value at most 2 − log 1−ε n .
[The value of a solution is the ratio of edges covered compared to |E|, the number of edges.]Consequently, the inapproximability bounds for LabelCover-Max are very large.
We show that an approximation algorithm for Densest Simultaneous Subgraph leads to an approximation algorithm for LabelCover-Max with the following guarantees.
Proof.Take an instance (G, N, Π) of LabelCover-Max.We build an instance of Densest Simultaneous Subgraph on a collection of graphs H as follows.There is one graph H e ∈ H for each edge of G.Each graph contains the same vertex set: there is a vertex (u, i) in H e for each pair u ∈ V (G), i ∈ [N ].The edge sets of the graphs, however, are disjoint.For an edge e = (u, v) ∈ G, there is an edge in H e between (u, i) and (v, j) if and only if Π (u,v) (i) = j.Thus, if |A| = q = |B| then H contains qd graphs and each such graph H e is a bipartite graph with 2qN vertices.
We now add an extra graph and extra vertices so that later in the proof, we are guaranteed solutions of size s have density at most 1/s.We add two isolated vertices û, v to the vertex set (of each graph in H) and add a new graph Ĥ containing only one edge (û, v).

Note that we may partition the vertices of
Clearly any optimal solution S * to the instance of Densest Simultaneous Subgraph must use at least one vertex from each of these sets.Otherwise there is at least one (in fact, at least d) graph H e within which no edges are induced and, thus, the minimum density is zero.Furthermore, û and v are both in S * or the density of S * in Ĥ is zero.So the optimal solution S * has cardinality at least 2q + 2.
Observe that if an edge ((u, i), (v, j)) is induced by S * in H u,v then the corresponding edge in LabelCover-Max is covered, provided we set (u) = i and (v) = j.For our hardness result, we may assume that all the edges in the LabelCover-Max instance can be covered.Thus, we may assume that the solution S * induces a density D * of at least 1 2q+2 in each graph.By our hypothesis, we can approximate D * to within an α factor.Thus we obtain a solution S with density at least 1 2αq+2α .By the construction of Ĥ, S has size at most 2αq + 2α < 3αq.We now use S to build a solution to the instance of LabelCover-Max.
Let X = {v ∈ G : |W v ∩ S| > 6α}.Now |X| < 1 2 q, otherwise, |S| > 1 2 q • 6α = 3αq.Furthermore, as G is d-regular the vertices in X cover at most half of the dq edges of G; thus the vertices in X = (A ∪ B) \ X cover at least half of the edges.
Take the set S = {(v, i) ∈ S : v ∈ X}.From S , we build a random labelling by selecting a random node (v, i) in S ∩ W v , for each vertex v ∈ X.We then set (v) = i.Because |W v ∩ S| ≤ 6α for all v ∈ X, any edge induced by X is covered by this labelling with probability at least 1 36α 2 .Thus, this labelling covers at least 1 36α 2 • 1 2 dq = 1 72α 2 • dq edges, as desired.By derandomizing this reduction, we obtain the following hardness result.Theorem 3. Densest Simultaneous Subgraph is not approximable within 2 log 1−ε n for any ε > 0, unless N P ⊆ DT IM E(n polylogn ).
Proof.So we need to alter the proof of Theorem 13 so that random choices are not used to recover the solution.To do so, instead of sampling from the approximate solution S, we will essentially compute the expected value of picking each vertex (u, i) for each i, choosing the vertex maximizing this expectation and repeat this process for each u (but conditioning on choices already made in our computation).
Formally, we let v 1 , v 2 , . . ., v 2q be the vertices of G (ordered arbitrarily).Recall that our proof of Theorem 13 selects a label L i for v i uniformly at random amongst all labels with (v i , ) ∈ S ∩ W vi (it does this for all i from 1 to 2q).This defines 2q independent variables L 1 , . . ., L 2q .We now see how to deterministically assign values to L 1 , . . ., L 2q so that the number of edges covered by this assignment is at least the expected number of edges covered by assigning values randomly.Let Covered( 1 , . . ., 2q ) denote the number of covered edges given labels i to v i .
For each i from 1 to 2q, proceed as follows.For each and pick i so that e(i, i ) = max e(i, ).It is easy to see that this algorithm produces a solution at least E[Covered(L 1 , . . ., L 2q )] since for each i, by our choice of i , Furthermore, these results extend to the robust variations discussed in the Introduction.This follows via standard techniques, so we defer the corresponding proof (along with a formal definition of robustness) to the Appendix.

Non-zero density
To conclude our discussion on the density measure, we remark that clearly the most basic structure we can possibly search for is a single edge.But an induced subgraph that contains a single edge has non-zero density and vice versa.This leads to the following cluster problem.

Non-Zero Density Problem
Input: It can then be seen that our hardness proof also applies to Non-Zero Density.Thus this very basic problem is extremely hard to approximate!Corollary 2. Non-Zero Density is not approximable within 2 Of course, if it is very hard to search for a single edge then it is not surprising that quantitative guarantees for practical simultaneous clustering problems are rare.

The Connectivity Measure
Now let's consider the simultaneous cluster problem using our second quality measure, namely graph connectivity.Our vertex set is partitioned into two: a subset T ⊆ V of terminals and a set V \ T of steiner vertices.A cluster S ⊆ V is then considered good if every pair of terminals in S is simultaneously connected (or k-connected) with respect to each graph.As described in Section 2, notions of terminal connectivity have been applied to expand our current knowledge of genes involved in certain biological processes by treating the known genes in these processes as terminal nodes.Some applications have also treated all nodes (genes) as terminals to detect clusters of functionally coherent genes from biological networks where connectivity implies functional similarity.
Once the desired connectivity k is specified, there are two natural optimization criteria.The first is a maximization criterion, we may desire a good cluster that contains as many terminals as possible.Since connectivity is a monotonic property with regards to the steiner nodes, it can never hurt to add additional steiner vertices to such a cluster.Consequently, this maximization criterion is likely to produce very large clusters.Therefore, the second natural criterion is to minimize the cardinality of a good cluster.
In this section we present both good news and bad.The simultaneous cluster problem is tractable in polynomial time with respect to the maximization measure, but is very hard to approximate with the minimization measure.

Terminal Maximization
Consider then our maximization problem.
For a fixed connectivity requirement k, this problem is polynomial time solvable.We remark that this is the case for both vertex-connectivity and edgeconnectivity requirements.We show how to solve the vertex-connectivity version using the following recursive approach (the approach for edge-connectivity is similar).We are given a collection G of graphs with vertex set V and terminal set T .If every pair of terminals are k-connected in every graph G i ∈ G then we simply output the cluster S = V .
If not, by Menger's Theorem, we can find terminals t 1 and t 2 that are separated by a vertex-cut W (with cardinality less than k) in some graph G j .So, assume ). Observe that T 1 and T 2 need not be disjoint but |T 1 ∩T 2 | must be less than k.
We now recurse on the subproblems G 1 and G 2 .Here G 1 contains graphs ], for all 1 ≤ i ≤ t, and has terminal set T 1 .Similarly, G 2 contains the graphs ] and has terminal set T 2 .Note that each subproblem contains all the steiner nodes.
Finally, when the algorithm terminates on every subproblem we simply output the best cluster obtained amongst all the subproblems.Let's see that this algorithm gives a polynomial time algorithm.Theorem 4. For a fixed connectivity requirement k, there is a polynomial time algorithm for Maximum Simultaneous k-Connected Steiner Cluster.
Proof.First we need to show that the algorithm gives an optimal solution.The terminals in the cluster output by each subproblem are k-connected, otherwise the algorithm would have found a new vertex-cut to recurse on.So the clusters are feasible solutions.Suppose the optimal solution set of terminals T * is not output.Then consider the first time at which two terminals t 1 , t 2 ∈ T * are separated by the algorithm.At this point, let the subproblem consist of the terminals T and all the steiner nodes, and let W be the vertex-cut separating t 1 and t 2 .But T * ⊆ T .So W must also separate t 1 and t 2 in the graph induced by T * and all the steiner nodes.This is a contradiction as, by definition, the cluster consisting of T * and all the steiner nodes k-connects all the terminals in T * .
Second we need to show how to implement the algorithm in polynomial time.We do this in two stages.In Stage I, we only run the method until each subproblem contains at most k + 1 terminals.In Stage II, we solve each of these subproblems by brute force, that is, for every subset of the terminals in the subproblem, we check if those terminals are k-connected using all the steiner nodes, in every graph.
To analyse the running time for Stage I, we show that at most |T | − k subproblems can be examined in this stage.To search for the vertex-cut, we only need to run k|T | max-flow algorithms to check all the terminal pairs.Each flow algorithm takes time O(km) as we can stop if the flow between a pair exceeds k.We must do this on each of the t graphs, so this Thus the total run time of the algorithm is polynomial for any fixed k.

Cluster Minimization
On the other hand, if we wish to minimize the number of vertices, the problem becomes hard again, even for the simplest connectivity requirement k = 1.Interestingly, it remains very difficult even in the two extremes cases where (i) there are only two terminals, and (ii) every vertex is a terminal.Let's begin with the case of exactly two terminals, say T = {s, t}.Then our minimization problem is: Theorem 14.If Simultaneous s-t Path is α-approximable then LabelCover-Max is 1 72α 2 -approximable.Proof.Take an instance (G, N, Π) of LabelCover-Max where G has bipartition (A G , B G ).We build an instance of Simultaneous s-t Path by first building an instance H of Densest Simultaneous Subgraph as in the proof of Theorem 13 but without the vertices û and v.Note that each graph H e ∈ H is bipartite with bipartitions We build a graph F e from each graph H e by • adding a vertex s with edges between s and every vertex of A H , and • adding a vertex t with edges between t and every vertex of B H .
Let F be the collection of (a) graphs F e built from each graph H e ∈ H, and (b) further graphs F s , F t , and F v (for each v ∈ A G ∪ B G ) built over the same set of nodes.F s has edges between s and every other vertex, and F t has edges between t and every other vertex.For each v ∈ A G ∪ B G , F v has edges between every vertex of (v, i) (for all values of the label i i.e., for all i ∈ [N ]) and every other vertex.No other edges are present in these graphs.Now any solution must contain s, t, a vertex (u, i) for each vertex u ∈ A G and a vertex (v, j) for each vertex v ∈ B G .Otherwise, the subgraph is not connected in F s , F t , F u for some u ∈ A G or F v for some v ∈ B G .
Again, any solution S to our instance F of Minimum Simultaneous Connected Steiner Cluster with {û, v} added and {s, t} removed is a solution to the instance H of Densest Simultaneous Subgraph of the same value.
Thus we obtain the hardness results of Theorem 5 and Theorem 6.
Theorem 6. Minimum Simultaneous Connected Steiner Cluster is not approximable within 2 log 1−ε n for any ε > 0, unless N P ⊆ DT IM E(n polylogn ).
Similar hardness results also extend to the problem of finding an approximate solution that is required to satisfy connectivity constraint in only a c fraction of the input graphs.Again, these robustness results are deferred to the Appendix.

Lower Bounds for a Fixed Number of Graphs
Polynomial lower bounds can be obtained in terms of the number g of input graphs.Clearly for Simultaneous s-t Path, we can also obtain a gapproximation given g input graphs by simply taking the union of all solutions in each individual graph.
In this section, we show for any ε > 0 both Simultaneous s-t Path and Minimum Simultaneous Connected Steiner Cluster are g 1/2−εinapproximable unless N P = ZP P .We use a similar approach to other k ε complexity results for problems with a fixed parameter k [13,15].Again, by the PCP theorem and Raz's parallel repetition theorem we have: Theorem 16. [35,6] There exists a constant γ > 0 (independent of ) such that the LabelCover-Max problem obtained from instances of Max-3Sat( 5) with repetitions cannot be approximated within a factor of 2 γ .(For constant , this holds if P = N P .For = polylog(n), this holds under the assumption N P ⊆ DT IM E(polylog(n)).) Here, Max-3Sat(5) simply refers to Max-Sat instances where there are 3 variables in each clause and every variable appears in 5 clauses.Since instances of LabelCover-Max obtained from Max-3Sat(5) with -repetitions are (3 , 5 )-regular, we obtain the following corollary.
Corollary 3.There exists a constant γ > 0 (independent of ) such that the d-regular LabelCover-Max problem cannot be approximated within a factor of d γ .(For constant d, this holds if P = N P .For d = n α , this holds under the assumption N P ⊆ DT IM E(polylog(n)).)Thus, it suffices to build a g = d β (for some constant β > 0) instance of our problem of interest from a d-regular instance of LabelCover-Max to obtain g ε = d γ /β inapproximability for our problem.
To improve this to an g 1/2−ε -inapproximability result, we use Goldreich and Sudan's [20] random sampling technique that reduces the degree of the instance of LabelCover-Max needed.This allows us to improve the bound from Corollary 3 to g 1/2−ε under the assumption that N P = ZP P .Theorem 17. [14,28] For any ε > 0, it is hard to approximate instances of LabelCover-Max where vertices have degrees between d/4 and d within a factor of d 1/2−ε , unless N P = ZP P .Thus, it suffices to build a g = d β (for a fixed β > 0) instance of our problem of interest from a d-regular instance of LabelCover-Max to obtain d 1/2−ε = g 1/(2β)−ε -inapproximability for our problem.
We are now ready to prove Theorems 7 and 8.In our case, the number of graphs is linear in the degree of the input graph to LabelCover-Max (i.e., β = 1).
Theorem 7. Simultaneous s-t Path is not g 1/2−ε -approximable for any ε > 0 where g is the number of graphs unless N P = ZP P .
Proof.We reduce our problem from LabelCover-Max and construct an instance of Simultaneous s-t Path whose number of graphs is linear in the degree of the graph in LabelCover-Max.Take an instance (G, N, Π) of LabelCover-Max where G has bipartition (A, B).We build an instance of Simultaneous s-t Path by first building an instance H of Densest Simultaneous Subgraph as in the proof of Theorem 13 but without the vertices û and v (note although G is not regular, this construction is still well defined).We then build a new instance F of Simultaneous s-t Path using d graphs from H.
Note that each graph H e ∈ H is bipartite with bipartitions Since G is bipartite and of maximum degree d, we can partition its edges into d matchings M 1 , . . ., M d .Let u i,1 v i,1 , . . ., u i,q v i,q be the edges of M i .We construct F i by taking the union of all (edges in) H e , e ∈ M i and adding a source s and a sink t and the following edges C i , S i and T i .
(v i,j , 1 )(u i,j+1 , 2 ) s(u i,1 , ) (v i,q , )t Note that every st-path in F i uses at least one edges from each H e , e ∈ M i (since each E(H e ) is an st-cut in F i ).Furthermore, we can obtain an st-path by choosing (any) one edge from each H e , e ∈ M i and the appropriate edges in C i , S i and T i .
Therefore, there is an st-path in all F i if and only if S induces an edge in each H e ∈ H.The result now follows from Theorem 17. Theorem 8. Minimum Simultaneous Connected Steiner Cluster is not g 1/2−ε -approximable for any ε > 0 where g is the number of graphs unless N P = ZP P .
Proof.Take the instance of Simultaneous s-t Path from Theorem 7 above and add a graph F s which is a star centered at s and add a graph F t which is a star centered at t. Now, every solution S to our instance of Minimum Simultaneous Connected Steiner Cluster contains s and t (or one of F s [S] or F t [S] is disconnected).Therefore, all solutions to our instance of Minimum Simultaneous Connected Steiner Cluster is also a solution to the original instance of Simultaneous s-t Path.
To complete the proof, we show that any feasible solution S to Simultaneous s-t Path corresponds to a feasible solution S * = S ∪ {s, t} to Minimum Simultaneous Connected Steiner Cluster.F s [S * ] is connected since s ∈ S * and every other vertex has an edge to s. F t [S * ] is connected since t ∈ S * and every other vertex has an edge to t.
Since S is a solution to Simultaneous s-t Path, for any i, there is an st path P = {s, (u i,1 , 1 ), (v i,1 , 1 ), (u i,2 , 2 ), (v i,2 , 2 ), . . ., (u i,q , q ), (v i,q , q ), t} in F i [S] and since M i is a perfect matching, this path P contains a vertex (x, j) for each x ∈ A ∪ B. We now show that every other vertex of F i has an edge to P , thus proving F i [S * ] is connected.
Indeed, for any vertex (u i,j , k) ∈ A H with u i ∈ A and u i,j v i,j ∈ M i , either • j = 1 and s(u i,j , k) ∈ S i so (u i,j , k) is adjacent to P , or • j > 1 and (v i,j−1 , j−1 )(u i,j , k) ∈ C i so again (u i,j , k) is adjacent to P .
We use a symmetric proof for any vertex (v i,j , k) ∈ B H with v i ∈ B and u i,j v i,j ∈ M .Either • j = q and (v i,j , k)t ∈ S i so (v i,j , k) is adjacent to P , or • j < q and (v i,j , k)(u i,j+1 , j+1 ) ∈ C i so again (v i,j , k) is adjacent to P .Thus, all the F i [S * ] are connected and the theorem follows by Theorem 17.

Conclusion and Directions
We have presented algorithmic and complexity results for the problem of finding clusters supported by multiple graphs, where each graph represents distinct set of similarity relationships (edges) over the same set of objects (nodes).While we obtain tractable algorithms for certain measures of cluster quality, we show that the problem is typically hard to approximate even when we relax many of the requirements, such as relaxing the problem from many graphs to just two graphs for the density measure, connectivity among many terminals to just two terminals, or quality constraints of a solution to be met in only a fraction of the input graphs.
The implications of our results are two-fold.First, our results explain why guarantees on the clustering quality or running time have been elusive in the vast amount of previous empirical and heuristic works on simultaneous clustering of datasets arising in scientific and commercial domains.Second, our work suggests alternate problem abstractions may also be suitable for quantitative study.
For example, we could consider a new model where the input graphs have correlated edge weights, since the hardness of most problems we consider stem from allowing the graphs to have arbitrary edge weights.Assuming the similarity function of different input graphs to be correlated for all edges is not realistic though, especially in the biological sciences where the datasets are very noisy, incomplete and heterogeneous (due to factors like the different types of cellular responses each input network captures, highly incomplete nature of networks assembled from small-scale biological studies, bias or batch effect or technology-dependent artifacts affecting networks inferred from large-scale biological studies, etc. [24]).However, we could reasonably assume that those edges present in the optimal solution or subgraph have correlated edge weights across the input graphs.Introducing this assumption may make the problem tractable by allowing us to exclude edges that are not correlated in the graphs before searching for the optimal solution.
Comparative analysis of clustering structures between multiple networks is another pressing problem in data integration.Given a separation of the input networks into two classes A and B (say diseased vs. healthy), can we find subgraphs that cluster well in most of the class A networks and poorly in most of the class B networks?
produces a subgraph of density at least d with at most k vertices in G.We let k = βn 2 , |H 1 | = τ 1 k, and |H 2 | = τ 2 n 2 .As n 2 is the cardinality of the smallest graph with dk edges, it must be the case that k ≥ n 2 (since the desired subgraph H * in G has k vertices and dk edges).So k = βn 2 for some β ≥ 1.
search takes time O(|T | • k 2 mt).If there are at most |T | − k subproblems in Stage I then the total run time for the stage is O(|T | 2 • k 2 mt).We show by induction that the number of subproblems in Stage I is indeed at most |T | − k.For the base case, if |T | = k + 1 then we stop immediately.Consequently, there is only one subproblem to consider.So consider the case where |T | > k + 1. Suppose |T | is split into T 1 and T 2 by the vertex-cut.By induction, the number of subproblems considered for T i is at most |T i | − k, for i = {1, 2}.Moreover, we know that |T 1 ∩ T 2 | ≤ k − 1.Thus the total number of subproblems considered for T is at most 1 + |T 1 | − k + |T 2 | − k ≤ 1 + (|T | + k − 1) − k − k = |T | − k as desired.Now consider the running time for Stage II.When |T | ≤ k + 1 we can simply use brute force.For every subset of the terminals, check whether those terminals are k-connected using all the steiner nodes, in every graph.By the method above this takes time O(|T | • k 2 mt) = O(k 3 mt).There are 2 k+1 subsets and |T | − k subproblems so the run time Stage II is at most O(2 k • |T | • k 3 mt).
and special vertices s and t.Objective: A minimum cardinality cluster S ⊆ V inducing an s − t path in each G i [S].

Theorem 1 .
If we can solve Densest Simultaneous Subgraph on two graphs in polynomial time then we can solve Densest k-Subgraph in polynomial time.If we can approximate Densest Simultaneous Subgraph on two graphs within a factor of α then we can approximate Densest k-Subgraph within a factor of 4α 2 .
We now build a graph F e from each graph H e by• adding a source s with edges from s to every vertex of A H , and • adding a sink t with edges from every vertex of B H to t.Let F be the collection of graphs F e built from each graph H e ∈ H. Now, a solution S to our instance F of Simultaneous s-t Path with {û, v} added and {s, t} removed is a solution to the instance H of Densest Simultaneous Subgraph of the same value.Now consider the other extreme, where all vertices are terminals.So in any solution cluster, every pair of vertices in that cluster must be connected in each induced graph G i [S].Minimum Simultaneous Connected Steiner Cluster Input.Graphs G i = (V, E i ) for 1 ≤ i ≤ t.Objective.A minimum cardinality cluster S ⊆ V (of size at least 2) such that every pair of vertices in S is connected in each induced graph G i [S].Theorem 15.If Minimum Simultaneous Connected Steiner Cluster is α-approximable then LabelCover-Max is 1 72α 2 -approximable.Proof.Take an instance (G, N, Π) of LabelCover-Max where G has bipartition (A, B).We build an instance of Minimum Simultaneous Connected Steiner Cluster by first building an instance H of Densest Simultaneous Subgraph as in the proof of Theorem 13 but without the vertices û and v.Note that each graph H e ∈ H is bipartite with bipartitions