Connectivity problems on heterogeneous graphs

Background Network connectivity problems are abundant in computational biology research, where graphs are used to represent a range of phenomena: from physical interactions between molecules to more abstract relationships such as gene co-expression. One common challenge in studying biological networks is the need to extract meaningful, small subgraphs out of large databases of potential interactions. A useful abstraction for this task turned out to be the Steiner Network problems: given a reference “database” graph, find a parsimonious subgraph that satisfies a given set of connectivity demands. While this formulation proved useful in a number of instances, the next challenge is to account for the fact that the reference graph may not be static. This can happen for instance, when studying protein measurements in single cells or at different time points, whereby different subsets of conditions can have different protein milieu. Results and discussion We introduce the condition Steiner Network problem in which we concomitantly consider a set of distinct biological conditions. Each condition is associated with a set of connectivity demands, as well as a set of edges that are assumed to be present in that condition. The goal of this problem is to find a minimal subgraph that satisfies all the demands through paths that are present in the respective condition. We show that introducing multiple conditions as an additional factor makes this problem much harder to approximate. Specifically, we prove that for C conditions, this new problem is NP-hard to approximate to a factor of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C - \epsilon $$\end{document}C-ϵ, for every \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C \ge 2$$\end{document}C≥2 and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon > 0$$\end{document}ϵ>0, and that this bound is tight. Moving beyond the worst case, we explore a special set of instances where the reference graph grows monotonically between conditions, and show that this problem admits substantially improved approximation algorithms. We also developed an integer linear programming solver for the general problem and demonstrate its ability to reach optimality with instances from the human protein interaction network. Conclusion Our results demonstrate that in contrast to most connectivity problems studied in computational biology, accounting for multiplicity of biological conditions adds considerable complexity, which we propose to address with a new solver. Importantly, our results extend to several network connectivity problems that are commonly used in computational biology, such as Prize-Collecting Steiner Tree, and provide insight into the theoretical guarantees for their applications in a multiple condition setting.


Background
In molecular biology applications, networks are routinely defined over a wide range of basic entities such as proteins, genes, metabolites, or drugs, which serve as nodes. The edges in these networks can have different meanings, depending on the particular context. For instance, in protein-protein interaction (PPI) networks, edges represent physical contact between proteins, either within stable multi-subunit complexes or through transient causal interactions (i.e., an edge (x, y) means that protein x can cause a change to the molecular structure of protein y and thereby alter its activity). The body of knowledge encapsulated within the human PPI network (tens of thousands of nodes and hundreds of thousands of edges in current databases, curated from thousands of studies [1]) is routinely used by computational biologists to generate hypotheses of how various signals are transduced in eukaryotic cells [2][3][4][5][6]. The basic premise is that a process that starts with a change to the activity of protein u and ends with the activity of protein v must be propagated through a chain of interactions between u and v. The natural extension regards a process with a certain collection of protein pairs {(u 1 , v 1 ), . . . , (u k , v k )} , where we are looking for a chain of interactions between each u i and v i [7]. In another set of applications, the notion of directionality is not directly assumed and instead, one is looking for a parsimonious subgraph that connects together a set S of proteins that are postulated to be active [8,9].
In most applications, the identity of the so called terminal nodes (i.e., (u i , v i ) pairs or the set S) is assumed to be known (or inferred from experimental data such as ChIPseq [5,8,9]), while the identity of the intermediate nodes and interactions is unknown. The goal therefore becomes to complete the gap and find a probable subgraph of the PPI network that simultaneously satisfies all the connectivity demands, thereby explaining the overall biological activity. Since the edges in the PPI network can be assigned a probability value (reflecting the credibility of their experimental evidence), by taking the negative log of these values as edge weights, the task becomes minimizing the total edge weight, leading to an instance of the Steiner Network problem. We have previously used this approach to study the propagation of a stabilizing signal in pro-inflammatory T cells, leading to the identification of a new molecular pathway (represented by a sub-graph of the PPI network) that is critical for mounting an autoimmune response, as validated experimentally by perturbation assays and disease models in mice [5]. Tuncbag et al. [9] have utilized the undirected approach using the Prize-Collecting Steiner Tree model, where the input is a network G along with a penalty function, p(v) for each protein (node) in the network (based on their importance; e.g., fold-change across conditions). The goal in this case is to find a probable subtree which contains the majority of the high cost proteins in G, while accounting for penalties paid by both edge usage and missing proteins, in order to capture the biological activity represented in such a network [8,9].
While these studies contributed to our understanding of signal transduction pathways in living cells, they do not account for a critical aspect of the underlying biological complexity. In reality, proteins (nodes) can become activated or inactivated at different conditions, thereby giving rise to a different set of potential PPIs that might take place [10]. Here, the term condition can refer to different points in time [11], different treatments [12], or, more recently, different cells [13]. Indeed, advances in experimental proteomics provide a way to estimate these changes at high throughput, e.g., measuring phosphorylation levels or overall protein abundance, proteome-wide for a limited number of samples [12]. A complementary line work provides a way to evaluate the abundance of smaller numbers of proteins (typically dozens of them) in hundreds of thousands of single cells [13].
The next challenge is therefore to study connectivity problems that take into account not only the endpoints of each demand, but also the condition in which these demands should be satisfied. This added complication was tackled by Mazza et al. [14], who introduced the "Minimum k-Labeling (MKL)" problem. In this setting, each connectivity demand comes with a label, which represents a certain experimental condition or time point. The task is to label edges in the PPI network so as to satisfy each demand using its respective label, while minimizing the number of edges in the resulting sub-graph and the number of labels used to annotate these edges. While MKL was an important first step, namely introducing the notion of different demands for each condition, the more difficult challenge still remains that of considering variability in the reference graph, namely different sets of proteins that may be active and available for use in each condition. To this effect, we note the existence of multi-layer networks in the data-mining space. In this context, studies have focused on networks which have edges that span across specified dimensions, or conditions [15,16]. However, we could not find studies that tackle the problem of parsimonious connectivity in this domain.

Summary of main contributions
To address this open challenge, here we introduce the Condition Steiner Network (CSN) problem. In this setting, we are given a weighted undirected graph G, a set of C conditions and a set of k ≥ C demands, at least one per condition (note that we also cover the case of directed graphs, with similar results). The conditions are specified over a sequence of graphs G c defined over each condition, where vertices remain the same, but edges are allowed to change across conditions (notably, our results also hold when G c is defined with changing vertices rather than edges). Furthermore, demands are in the form of "connect node u to node v through a path of nodes that are present in condition c". The goal is to find a minimumweight subgraph of G that satisfies all the demands (Fig. 1). We first show that it is NP-hard to find a solution that achieves a nontrivial approximation factor (by the "trivial" approximation, we mean the one obtained by solving the problem independently for each condition). This result extends to several types of connectivity problems and provides a theoretical lower bounds to the best-possible approximation guarantee that can be achieved in a multiple condition setting (Table 1). For instance, we can conclude that concomitantly solving the shortest path problem for a set of conditions is hard to approximate, and that the trivial solution (i.e., solving the problem to optimality in each condition) is, theoretically, the best that one can do. Another example, commonly used in PPI analysis, is the Prize-Collecting Steiner Tree problem [8,9]. Here, our results indicate that given a fixed input for this problem (i.e., a penalty function p(v) for each vertex), it is NP-hard to solve it concomitantly in C conditions, such that the weight of the obtained solution is less than C times that of the optimal solution. Interestingly, a theoretical guarantee of C · (2 − 2 |V | ) 1 can be obtained by solving the problem independently for each time point While these results provide a somewhat pessimistic view, they rely on the assumption that the network frames G c are arbitrary. In the last part of this paper, we show that for the specific case where the conditions can be ordered such that each condition is a subset of the

Table 1 Approximation bounds for the various Steiner Network Problems in their classic setting and condition setting
For the classic problems, we have indicated the papers in which the bounds are shown. For the condition problems, all the lower bounds are developed in the present work; all the upper bounds are the naive bounds obtained from the "union of shortest paths" heuristic, or from applying the best known approximation algorithm for the appropriate classic Steiner problem to each condition, then taking the union of those solutions

Problems
Classic Condition

Introduction to Steiner problems
The Steiner Tree problem, along with its many variants and generalizations, form a core family of NP-hard combinatorial optimization problems. Traditionally, the input to one of these problems is a single (usually weighted) graph, along with requirements about which nodes need to be connected in some way; the goal is to pick a minimum-weight subgraph satisfying the connectivity demands.
In this paper, we offer a multi-condition perspective; in our setting, multiple graphs over the same vertex set (which one can think of as an initial graph changing over a set of discrete conditions), are all given as input, and the goal is to pick a subgraph satisfying condition-sensitive connectivity requirements. Our study of this problem draws motivation and techniques from several lines of research, which we briefly summarize.

Classic Steiner problems
A basic problem in graph theory is finding the shortest path between two nodes; this problem is efficiently solved using, for example, Dijkstra's algorithm.
A natural extension of this is the Steiner Tree problem: given a weighted undirected graph G = (V , E) and a set of terminals T ⊆ V , find a minimum-weight subtree that connects all the nodes in T. A further generalization is Steiner Forest: given G = (V , E) and a set of demand pairs D ⊆ V × V , find a subgraph that connects each pair in D. Currently the best known approximation algorithms give a ratio of 1.39 for Steiner Tree [17] and 2 for Steiner Forest [18]. These problems are known to be NPhard to approximate to within some small constant [19].
For directed graphs, we have the Directed Steiner Network (DSN) problem, in which we are given a weighted directed graph G = (V , E) and k demands (a 1 , b 1 ), . . . , (a k , b k ) ∈ V × V , and must find a minimum-weight sub-graph in which each a i has a path to b i . When k is fixed, DSN admits a polynomial-time exact algorithm [20]. For general k, the best known approximation algorithms have ratio O(k 1/2+ǫ ) for any fixed ǫ > 0 [21,22]. On the complexity side, Dodis and Khanna [23] ruled out a polynomial-time O(2 log 1−ǫ n )-approximation for this problem unless NP has quasipolynomial-time algorithms. 2 An important special case of DSN is Directed Steiner Tree, in which all demands have the form (r, b i ) for some root node r. This problem has an O(k ǫ )-approximation scheme [24] and a lower bound of �(log 2−ǫ n) [25].
Finally, a Steiner variant that has found extensive use in computational biology is the Prize-Collecting Steiner Tree problem, in which the input contains a weighted undirected graph G = (V , E) and penalty function p : V → R ≥0 ; the goal is to find a subtree which simultaneously minimizes the weights of the edges in the tree and the penalties paid for nodes not included within the tree, i.e. cost(T ) := e∈T w(e) + v / ∈T p(v) . For this problem, an approximation algorithm with ratio 1.967 is known [26].

Condition Steiner problems
In this paper, we generalize the Shortest Path, Steiner Tree, Steiner Forest, Directed Steiner Network, and Prize-Collecting Steiner Tree problems to the multicondition setting. In this setting, we have a set of conditions [C] := {1, . . . , C} , and are given a graph for each condition.
Our main object of study is the natural generalization of Steiner Forest (in the undirected case) and Directed Steiner Network (in the directed case), which we call Condition Steiner Network: Definition 1 (Condition Steiner Network (CSN)) We are given the following inputs: . Each edge e in the underlying edge set E := c E c has a weight w(e) ≥ 0.

A set of k connectivity demands
We assume that for every c ∈ C there exists at least one demand and therefore that k ≥ |C|.
We call G = (V , E) the underlying graph. We say a subgraph H ⊆ G satisfies demand (a, b, c) ∈ D if H contains an a-b path P along which all edges exist in G c . The goal is to output a minimum-weight subgraph H ⊆ G that satisfies every demand in D.

Definition 2 (Directed Condition Steiner Network (DCSN)
) This is the same as CSN except that all the edges Wu et al. Algorithms Mol Biol (2019) 14:5 are directed, and a demand (a, b, c) must be satisfied by a directed path from a to b in G c .
We can also define the analogous generalizations of Shortest Path, (undirected) Steiner Tree, and Prize-Collecting Steiner Tree. We give hardness results and algorithms for these problems by demonstrating reductions to and from CSN and DCSN.

Definition 3 (Condition Shortest Path (CSP), Directed
Condition Shortest Path (DCSP)) These are the special cases of CSN and DCSN in which the demands are precisely (a, b, 1), . . . , (a, b, C) where a, b ∈ V are common source and target nodes.

Definition 4 (Condition Steiner Tree (CST))
We are given a sequence of undirected graphs We say a subgraph H ⊆ (V , c E c ) satisfies the terminal set X c if the nodes in X c are mutually reachable using edges in H that exist at condition c. The goal is to find a minimum-weight subgraph H that satisfies X c for every c ∈ [C].

Definition 5 (Condition Prize-Collecting Steiner Tree (CPCST))
We are given a sequence of undirected graph . The goal is to find a subtree T that minimizes e∈T w(e) Finally, in molecular biology applications, it is often the case that all the demands originate from a common root node. To capture this, we define the following special case of DCSN: It is also natural to consider variants of these problems in which nodes (rather than edges) vary across the conditions, or in which both nodes and edges vary. In Problem variants, we show that all three variants are in fact equivalent; thus we focus on the edge-based formulations.

Our results
In this work, we perform a systematic study of the condition Steiner problems defined above, from the standpoint of approximation algorithms-that is, algorithms that return subgraphs whose total weights are not much greater than that of the optimal subgraph-as well as integer linear programming (ILP). Since all of the condition Steiner problems listed in the previous section turn out to be NP-hard (and in fact all of them except Shortest Path are hard even in the classic single-condition setting) we cannot hope for algorithms that find optimal solutions and run in polynomial time.
First, in Hardness of condition Steiner problems, we show a series of strong negative results, starting with (directed and undirected) Condition Steiner Network: Theorem 1 (Main Theorem) CSN and DCSN are NPhard to approximate to a factor of C − ǫ as well as k − ǫ for every fixed k ≥ 2 and every constant ǫ > 0 . For DCSN, this holds even when the underlying graph is acyclic.
Thus the best approximation ratio one can hope for is C or k; the latter upper bound is easily achieved by the trivial "union of shortest paths" algorithm: for each demand (a, b, c), compute the shortest a-b path at condition c; then take the union of these k paths. This contrasts with the classic Steiner Network problems, which have nontrivial approximation algorithms and efficient fixedparameter algorithms.
Next, we show similar hardness results for the other three condition Steiner problems. This is achieved by a series of simple reductions from CSN and DCSN.

Theorem 2 Condition Shortest Path, Directed Condition Shortest Path, Condition Steiner Tree, and Condition
Prize-Collecting Steiner Tree are all NP-hard to approximate to a factor of C − ǫ for every fixed C ≥ 2 and ǫ > 0.
Note that each of these condition Steiner problems can be naively approximated by applying the best known algorithm for the classic version of that problem in each graph in the input, then taking the union of all those subgraphs. If the corresponding classic Steiner problem can be approximated to a factor of α , then this process gives an α · C-approximation for the condition version. Thus using known constant-factor approximation algorithms, each of the condition problems in Theorem 2 has an O(C)-approximation algorithm. Our result shows that in the worst case, one cannot do much better.
While these results provide a somewhat pessimistic view, the proofs rely on the assumption that the edge sets in the input networks (that is, E 1 , . . . , E C ) do not necessarily bear any relationship to one another. In Monotonic special cases, we move beyond this worstcase assumption by studying a broad class of special cases in which the conditions are monotonic: if an edge e exists in some graph G c , then it exists in all the subsequent graphs G c ′ , c ′ ≥ c . In other words, each graph in the input is a subgraph of the next. For these problems, we prove the following two theorems: It has no �(log log n)-approximation algorithm unless NP ⊆ DTIME(n log log log n ).
In the directed case, for monotonic DCSN with a single source (that is, every demand is of the form (r, b, c) for a common root node r), we show the following: It has no �(log 2−ǫ n)-approximation algorithm unless NP ⊆ ZPTIME(n polylog(n) ).
These bounds are proved via approximation-preserving reductions to and from classic Steiner problems, namely Priority Steiner Tree and Directed Steiner Tree. Conceptually, this demonstrates that imposing the monotonicity requirement makes the condition Steiner problems much closer to their classic counterparts, allowing us to obtain algorithms with substantially better approximation guarantees.
Finally in application to protein-protein interaction networks, we show how to model various condition Steiner problems as integer linear programs (ILPs). In experiments on real-world inputs derived from the human PPI network, we find that these ILPs are capable of reaching optimal solutions in a reasonable amount of time. Table 1 summarizes our results, emphasizing how the known upper and lower bounds change when going from the classic Steiner setting to the condition Steiner setting.

Preliminaries
Note that the formulations of CSN and DCSN in the introduction involved a fixed vertex set; only the edges change over the conditions. It is also natural to formulate the Condition Steiner Network problem with nodes changing over condition, or both nodes and edges. However by the following proposition, it is no loss of generality to discuss only the edge-condition variant.

Proposition 1
The edge, node, and node-and-edge variants of CSN are mutually polynomial-time reducible via strict reductions (i.e. preserving the approximation ratio exactly). Similarly all three variants of DCSN are mutually strictly reducible.
We defer the precise definitions of the other two variants, as well as the proof of this proposition, to Problem variants.
In this edge-condition setting, it makes sense to define certain set operations on graphs, which will be of use in our proofs. To that end, let G 1 = (V , E 1 ) and G 2 = (V , E 2 ) be two graphs on the same vertex set. Their union, Next we state the Label Cover problem, which is the starting point of one of our reductions to CSN. Definition 7 (Label Cover (LC)) An instance of this problem consists of a bipartite graph G = (U , V , E) and a set of possible labels . The input also includes, for . The task is to find a labeling that satisfies as many edges as possible.
This problem was first defined in [28]. It has the following gap hardness, as shown by Arora et al. [29] and Raz [30].
Theorem 5 For every ǫ > 0 , there is a constant | | such that the following promise problem is NP-hard: Given a Label Cover instance (G, �, �) , distinguish between the following cases: • (YES instance) There exists a total labeling of G; i.e. a labeling that satisfies every edge. • (NO instance) There does not exist a labeling of G that satisfies more than ǫ|E| edges.
In Hardness of condition Steiner problems, we use Label Cover to show (2 − ǫ)-hardness for 2-CSN and 2-DCSN; that is, when there are only two demands. To prove our main result however, we will actually need a generalization of Label Cover to partite hypergraphs, called k-Partite Hypergraph Label Cover. Out of space considerations we defer the statement of this problem and its gap hardness to Proof of inapproximability for general C and k, where the (2 − ǫ)-hardness result is generalized to show (C − ǫ)-hardness and (k − ǫ)-hardness for general number of conditions C and demands k.

Overview of the reduction
Here we outline our strategy for reducing Label Cover to the condition Steiner problems. First, we reduce to the CSN problem restricted to having only C = 2 conditions and k = 2 demands; we call this problem 2-CSN. The directed problem 2-DCSN is defined analogously. Later, we obtain similar hardness for CSN with more conditions or demands by using the same ideas, but reducing from k-Partite Hypergraph Label Cover. Consider the nodes u 1 , . . . , u |U | on the "left" side of the LC instance. We build, for each u i , a gadget (which is a small sub-graph in the Steiner instance) consisting of multiple parallel directed paths from a source to a sink-one path for each possible label for u i . We then chain together these gadgets, so that the sink of u 1 's gadget is the source of u 2 's gadget, and so forth. Finally we create a connectivity demand from the source of u 1 's gadget to the sink of u |U | 's gadget, so that a solution to the Steiner instance must have a path from u 1 's gadget, through all the other gadgets, and finally ending at u |U | 's gadget. This path, depending on which of the parallel paths it takes through each gadget, induces a labeling of the left side of the Label Cover instance. We build an analogous chain of gadgets for the nodes on the right side of the Label Cover instance.
The last piece of the construction is to ensure that the Steiner instance has a low-cost solution if and only if the Label Cover instance has a consistent labeling. This is accomplished by setting all the u i gadgets to exist only at condition 1 (i.e. in frame G 1 ), setting the v j gadgets to exist only in G 2 , and then merging certain edges from the u i -gadgets with edges from the v j -gadgets, replacing them with a single, shared edge that exists in both frames. Intuitively, the edges we merge are from paths that correspond to labels that satisfy the Label Cover edge constraints. The result is that a YES instance of Label Cover (i.e. one with a total labeling) will enable a high degree of overlap between paths in the Steiner instance, so that there is a very low-cost solution. On the other hand, a NO instance of LC will not result in much overlap between the Steiner gadgets, so every solution will be costly.
Let us define some of the building blocks of the reduction we just sketched: • In a simple strand, we say that (c 1 , c 2 ) is the contact edge. Contact edges have weight 1; all other edges in our construction have zero weight. • A bundle is a graph gadget consisting of a source node b 1 , sink node b 2 , and parallel, disjoint strands from b 1 to b 2 . • A chain of bundles is a sequence of bundles, with the sink of one bundle serving as the source of another. • More generally, a strand can be made more complicated, by replacing a contact edge with another bundle (or even a chain of them). In this way, bundles can be nested, as shown in Fig. 2. • We can merge two or more simple strands from different bundles by setting their contact edges to be the same edge, and making that edge existent at the union of all conditions when the original edges existed (Fig. 2).
Before formally giving the reduction, we illustrate a simple example of its construction.
Example 1 Consider a toy Label Cover instance whose bipartite graph is a single edge, label set is = {1, 2} , color set is C = {1, 2} , and projection functions are shown: Our reduction outputs this corresponding 2-CSN instance: G 1 comprises the set of blue edges; G 2 is green. The demands are (u S 1 , u S 2 , 1) and (v S 1 , v S 2 , 2) . For the Label Cover node u, G 1 (the blue sub-graph) consists of two strands, one for each possible label. For the Label Cover node v, G 2 (green sub-graph) consists of one simple strand for the label '1' , and a bundle for label '2' , which branches out into two simple strands, one for each agreeing labeling of u. Finally, strands (more precisely, their contact edges) whose labels map to the same color are merged.
The input is a YES instance of Label Cover whose optimal labelings (u gets either label 1 or 2, v gets label 2) correspond to 2-CSN solutions of cost 1 (both G 1 and G 2 contain the (u, 1, v, 2)-path, and both contain the (u, 2, v, 2)-path). If this were a NO instance and edge e could not be satisfied, then the resulting 2-CSN subgraphs G 1 and G 2 would have no overlap.

Inapproximability for two demands
We now formalize the reduction in the case of two conditions and two demands; later, we extend this to general C and k. Proof Fix any desired ǫ > 0 . We describe a reduction from Label Cover (LC) with any parameter ε < ǫ (that is, in the case of a NO instance, no labeling satisfies more than an ε-fraction of edges) to 2-DCSN with an acyclic graph. Given the LC instance (G = (U , V , E), �, �) , construct a 2-DCSN instance ( G = (G 1 , G 2 ) , along with two connectivity demands) as follows. Create nodes u S 1 , . . . , u S

|U |+1
and v S 1 , . . . , v S |V |+1 . Let there be a bundle from each u S i to u S i+1 ; we call this the u i -bundle, since a choice of path from u S i to u S i+1 in G will indicate a labeling of u i in G. The u i -bundle has a strand for each possible label ℓ ∈ � . Each of these ℓ-strands consists of a chain of bundles-one for each edge (u i , v) ∈ E . Finally, each such (u i , ℓ, v)-bundle has a simple strand for each label r ∈ such that π In other words, there is ultimately a simple strand for each possible labeling of u i 's neighbor v such that the two nodes are in agreement under their mutual edge constraint. If there are no such consistent labels r, then the (u i , ℓ, v)-bundle consists of just one simple strand, which is not associated with any r. Note that every minimal u S 1 → u S |U |+1 path (that is, one that proceeds from one bundle to the next) has total weight exactly |E|.
Similarly, create a v j -bundle from each v S j to v S j+1 , whose r-strands (for r ∈ ) are each a chain of bundles, one for each (u, v j ) ∈ E . Each (u, r, v j )-bundle has a (u, ℓ, v j , r)path for each agreeing labeling ℓ of the neighbor u, or a simple strand if there are no such labelings.
Set all the edges in the u i -bundles to exist in G 1 only. Similarly the v j -bundles exist solely in G 2 . Now, for each . We now analyze the reduction. The main idea is that any u S i → u S i+1 path induces a labeling of u i ; thus the demand u S 1 , u S |U |+1 , 1 ensures that any 2-DCSN solution indicates a labeling of all of U. Similarly, v S 1 , v S |V |+1 , 2 forces an induced labeling of V. In the case of a YES instance of Label Cover, these two connectivity demands can be satisfied by taking two paths with a large amount of overlap, resulting in a low-cost 2-DCSN solution. In contrast when we start with a NO instance of Label Cover, any two paths we can choose to satisfy the 2-DCSN demands will be almost completely disjoint, resulting in a costly solution. We now fill in the details.
Suppose the Label Cover instance is a YES instance, so that there exists a labeling ℓ * u to each u ∈ U , and The following is an optimal solution H * to the constructed 2-DCSN instance: • To satisfy the demand at condition 1, for each u-bundle, take a path through the ℓ * u -strand. In particular for each (u, ℓ * u , v)-bundle in that strand, traverse the (u, ℓ * u , v, r * v )-path. • To satisfy the demand at condition 2, for each v-bundle, take a path through the r * v -strand. In particular for each (u, r * v , v)-bundle in that strand, traverse the (u, ℓ * u , v, r * v )-path.
In tallying the total edge cost, H * ∩ G 1 (i.e. the subgraph at condition 1) incurs a cost of |E|, since one contact edge in G is encountered for each edge in G. H * ∩ G 2 accounts for no additional cost, since all contact edges correspond to a label which agrees with some neighbor's label, and hence were merged with the agreeing contact edge in H * ∩ G 1 . Clearly a solution of cost |E| is the best possible, since every u S 1 → u S |U |+1 path in G 1 (and every v S 1 → v S |V |+1 path in G 2 ) contains at least |E| contact edges.
Conversely suppose we started with a NO instance of Label Cover, so that for any labeling ℓ * u to u and r * v to v, for at least (1 − ε)|E| of the edges (u, v) ∈ E , we have . By definition, any solution to the constructed 2-DCSN instance contains a simple u S 1 → u S |U |+1 path P 1 ∈ G 1 and a simple v S 1 → v S |V |+1 path P 2 ∈ G 2 . P 1 alone incurs a cost of exactly |E|, since one contact edge in G is traversed for each edge in G. However, P 1 and P 2 share at most ε|E| contact edges (otherwise, by the merging process, this implies that more than ε|E| edges could be consistently labeled, which is a contradiction). Thus the solution has a total cost of at least (2 − ε)|E|.
It is thus NP-hard to distinguish between an instance with a solution of cost |E|, and an instance for which every solution has cost at least (2 − ε)|E| . Thus a polynomial-time algorithm for 2-DCSN with approximation ratio 2 − ǫ can be used to decide Label Cover (with parameter ε ) by running it on the output of the aforementioned reduction. If the estimated objective value is at most (2 − ε)|E| (and thus strictly less than (2 − ǫ)|E| ) output YES; otherwise output NO. In other words, 2-DCSN is NP-hard to approximate to within a factor of 2 − ǫ.
To complete the proof, observe that the underlying directed graph we constructed is acyclic, as every edge points "to the right" as in Example 1. Hence 2-DCSN is NP-hard to approximate to within a factor of 2 − ǫ for every ǫ > 0 , even on acyclic graphs. Finally, note that the same analysis holds for 2-CSN, by simply making every edge undirected; however in this case the graph is clearly not acyclic.

Inapproximability for general C and k
Theorem 1 (Main Theorem) CSN and DCSN are NPhard to approximate to a factor of C − ǫ as well as k − ǫ for every fixed k ≥ 2 and every constant ǫ > 0 . For DCSN, this holds even when the underlying graph is acyclic.
Proof We perform a reduction from k-Partite Hypergraph Label Cover, a generalization of Label Cover to hypergraphs, to CSN, or DCSN with an acyclic graph. Using the same ideas as in the C = k = 2 case, we design k demands composed of parallel paths corresponding to labelings, and merge edges so that a good global labeling corresponds to a large overlap between those paths. The full proof is left to Proof of inapproximability for general C and k.
Note that a k-approximation algorithm is to simply Thus by Theorem 1, essentially no better approximation is possible in terms of k alone. In contrast, most classic Steiner problems have good approximation algorithms [21,22,24,25], or are even exactly solvable for constant k [20].

Inapproximability for Steiner variants
We take advantage of our previous hardness of approximation results in Theorem 1 and show, via a series of reductions, that CSP, CSN, and CPCST are also hard to approximate.

Theorem 2 Condition Shortest Path, Directed Condition Shortest Path, Condition Steiner Tree, and Condition
Prize-Collecting Steiner Tree are all NP-hard to approximate to a factor of C − ǫ for every fixed C ≥ 2 and ǫ > 0.
Proof We first reduce from CSN to CSP (and DCSN to DCSP). Suppose we are given an instance of CSN with graph sequence G = (G 1 , . . . , G C ) , underlying graph G = (V , E) , and demands Initialize G ′ to G. Add to G ′ the new nodes a and b, which exist at all conditions G ′ i . For all e ∈ E and i ∈ [k] , if e ∈ G c i , then let e exist in G ′ i as well. For each (a i , b i , c i ) ∈ D, Lastly, the demands are Given a solution H ′ ⊆ G ′ containing an a → b path at every condition i ∈ [k] , we can simply exclude nodes a, b, {x i } , and {y i } to obtain a solution H ⊆ G to the original instance, which contains an a i → b i path in G c i for all i ∈ [k] , and has the same cost. The converse is also true by including these nodes.  14:5 Observe that essentially the same procedure shows that DCSN reduces to DCSP; simply ensure that the edges added by the reduction are directed rather than undirected.
Next, we reduce CSP to CST. Suppose we are given an instance of CSP with graph sequence G = (G 1 , . . . , G C ) , We build a new instance of CST as follows: . , X C ) . Set G ′ to G , and G ′ to G. Take the set of terminals in each condition to be X i = {a, b} . We note that a solution H ′ ⊆ G ′ to the CST instance is trivially a solution the CSP instance with the same cost, and vice-versa.
Finally, we reduce CST to CPCST. We do this by making an appropriate assignment of the penalties p(v, c). Suppose we are given an instance of CST with graph sequence G = (G 1 , . . . , G C ) , underlying graph G = (V , E) , and terminal sets X = (X 1 , . . . , X C ) . We build a new instance of CPCST, . In particular, set G ′ to G , and G ′ to G. Set p(v, c) as follows: Consider any solution H ⊆ G to the original CST instance. Since H spans the terminals X 1 , . . . , X c (thus avoiding any infinite penalties), and since the non-terminal vertices have zero cost, the overall cost of H remains the same cost in the constructed CPCST instance. Conversely, suppose we are given a solution H ′ ⊆ G ′ to the constructed CPCST instance. If the cost of H ′ is ∞ , then H ′ does not span all the X c 's simultaneously, and thus H ′ is not a possible solution for the CST instance. On the other hand if H ′ has finite cost, then H ′ is also a solution for the CST instance, with the same cost.
To summarize: in the first reduction from CSN to CSP, the number of demands, k, in the CSN instance is the same as the number of the conditions, C, in the CSP instance; we conclude that CSP is NP-hard to approximate to a factor of C − ǫ for every fixed C ≥ 2 and ǫ > 0 . Since C remains the same in the two subsequent reductions, we also have that CST and CPCST are NP-hard to approximate to a factor of C − ǫ .

Monotonic special cases
In light of the strong lower bounds in the previous theorems, in this section we consider more tractable special cases of the condition Steiner problems. A natural restriction is that the changes over conditions are monotonic: Steiner problems), we have that for each e ∈ E and c ∈ [C] , if e ∈ G c , then e ∈ G c ′ for all c ′ ≥ c.
We now examine the effect of monotonicity on the complexity of the condition Steiner problems.

Monotonicity in the undirected case
In the undirected case, we show that monotonicity has a simple effect: it makes CSN equivalent to the following well-studied problem: Definition 9 (Priority Steiner Tree [31]) The input is a weighted undirected multigraph G = (V , E, w) , a priority level p(e) for each e ∈ E , and a set of k demands (a i , b i ) , each with priority p(a i , b i ) . The output is a minimum-weight forest F ⊆ G that contains, between each a i and b i , a path in which every edge e has priority p(e) ≤ p(a i , b i ).
Priority Steiner Tree was introduced by Charikar, Naor, and Schieber [31], who gave a O(log k) approximation algorithm. Moreover, it cannot be approximated to within a factor of �(log log n) assuming NP / ∈ DTIME(n log log log n ) [32]. We now show that the same bounds apply to Monotonic CSN, by showing that the two problems are essentially equivalent from an approximation standpoint.  (a i , b i , p(a i , b i )) . If there are parallel multiedges, break up each such edge into two edges of half the original weight, joined by a new node. Given a solution H ⊆ G to this CSN instance, contracting any edges that were originally multiedges gives a Priority Steiner Tree solution of the same cost. This reduction also works in the opposite direction (in this case there are no multiedges), which shows the equivalence.
Furthermore, the O(log k) upper bound applies to CST (We note that Monotonic CSP admits a trivial algorithm, namely take the subgraph induced by running Djikstra's Algorithm on G 1 ). Proof We now show a reduction from CST to CSN. Suppose we are given a CST instance on graphs G = (G 1 , . . . , G C ) and terminal sets X = (X 1 , . . . , X C ) .
Our CSN instance has precisely the same graphs, and has the following demands: for each terminal set X c , pick any terminal a ∈ X c and create a demand (a, b, c) for each b � = a ∈ X c . A solution to the original CST instance is a solution to the constructed CSN instance with the same cost, and vice-versa; moreover, if the CST instance is monotonic, then so is the constructed CSN instance.
Observe that if the total number of CST terminals is k, then the number of constructed demands is k − C , and therefore an f(k)-approximation for CSN implies an f (k − C) ≤ f (k)-approximation for CST, as required.

Monotonicity in the directed case
In the directed case, we give an approximation-preserving reduction from a single-source special case of DCSN to the Directed Steiner Tree (DST) problem (in fact, we show that they are essentially equivalent from an approximation standpoint), then apply a known algorithm for DST. Recall the definition of Single-Source DCSN: Definition 6 (Single-Source DCSN) This is the special case of DCSN in which the demands are precisely (a, b 1 , c 1 ), (a, b 2 , c 2 ), . . . , (a, b k , c k ) , for some root a ∈ V . We can assume that c 1 ≤ c 2 ≤ · · · ≤ c k . For the remainder of this section, we refer to Monotonic Single-Source DCSN as simply DCSN. Towards proving the theorem, we now describe a reduction from DCSN to DST. Given a DCSN instance Proof Let H ⊆ G be a DCSN solution having cost C * . For any edge (u, v) ∈ E(H) , define the earliest necessary condition of (u, v) to be the minimum c i such that removing (u, v) would cause H not to satisfy demand (a, b i , c i ) .

Claim 1
There exists a solution C ⊆ H that is a directed tree and has cost at most C * . Moreover for every path P i in C from the root a to some target b i , as we traverse P i from a to b i , the earliest necessary conditions of the edges are non-decreasing.
Proof of Claim 1 Consider a partition of H into edgedisjoint sub-graphs H 1 , . . . , H k , where H i is the subgraph whose edges have earliest necessary condition c i .
If there is a directed cycle or parallel paths in the first sub-graph H 1 , then there is an edge e ∈ E(H 1 ) whose removal does not cause H 1 to satisfy fewer demands at condition c 1 . Moreover by monotonicity, removing e also does not cause H to satisfy fewer demands at any future conditions. Hence there exists a directed tree T 1 ⊆ H 1 such that T 1 ∪ k i=2 H i has cost at most C * and still satisfies T . Now suppose by induction that for some j ∈ H i has cost at most C * and satisfies D . Consider the partial solution j i=1 T i ∪ H j+1 ; if this sub-graph is not a directed tree, then there must be an edge (u, v) ∈ E(H j+1 ) such that v has another in-edge in the sub-graph. However by monotonicity, (u, v) does not help satisfy any new demands, as v is already reached by some other path from the root. Hence by removing all such redundant edges, H i has cost at most C * and satisfies D , which completes the inductive step.
We conclude that T := k i=1 T i ⊆ H is a tree of cost at most C * satisfying D . Observe also that by construction, as T is a tree that is iteratively constructed by T i ⊆ H i , T has the property that if we traverse any a → b i path, the earliest necessary conditions of the edges never decrease. Now let T be the DCSN solution guaranteed to exist by Claim 1. Consider the sub-graph H ′ ⊆ G ′ formed by adding, for each (u, v) ∈ E(T ) , the edge (u c , v c ) ∈ E ′ where c is the earliest necessary condition of (u, v) in E(H) . In addition, for all ver- To see that H ′ is a valid solution, consider any demand (a 1 , b c i i ) . Recall that T has a unique a → b i path P i along which the earliest necessary conditions are nondecreasing. We added to H ′ each of these edges at the level corresponding to its earliest necessary condition; moreover, whenever there are adjacent edges (u, v), (v, x) ∈ P i with earliest necessary conditions c and c ′ ≥ c respectively, there exist in H ′ free edges path, which completes the proof. Proof First note that any DST solution ought to be a tree; let T ′ ⊆ G ′ be such a solution of cost C. For each (u, v) ∈ G , T ′ might as well use at most one edge of the form (u i , v i ) , since if it uses more, it can be improved by using only the one with minimum i, then taking the free edges (v i , v i+1 ) as needed. We create a DCSN solution T ⊆ G as follows: for each (u i , v i ) ∈ E(T ′ ) , add (u, v) to T . Since w(u, v) = w(u i , v i ) by design, we have cost(T ) ≤ cost(T ′ ) ≤ C . Finally, since each a 1 → b t i i path in G ′ has a corresponding path in G by construction, T satisfies all the demands. Lemma 3 follows from Lemma 4 and Lemma 5. Finally we can obtain the main result of this subsection: Theorem 4 Monotonic Single-Source DCSN has a polynomial-time O(k ǫ )-approximation algorithm for every ǫ > 0 . It has no �(log 2−ǫ n)-approximation algorithm unless NP ⊆ ZPTIME(n polylog(n) ).
Proof The upper bound follows by composing the reduction (from Monotonic Single-Source DCSN to Directed Steiner Tree) with the algorithm of Charikar et al. [24] for Directed Steiner Tree, which achieves ratio O(k ǫ ) for every ǫ > 0 . More precisely they give an i 2 (i − 1)k 1/i -approximation for any integer i ≥ 1 , in time O(n i k 2i ) . The lower bound follows by composing the reduction (in the opposite direction) with a hardness result of Halperin and Krauthgamer [25], who show the same bound for Directed Steiner Tree. A quick note regarding the reduction in the opposite direction: Directed Steiner Tree is a precisely a Monotonic Single-Source DCSN instance with exactly one condition.
In Explicit algorithm for Monotonic Single-Source DCSN, we show how to modify the algorithm of Charikar et al. to arrive at a simple, explicit algorithm for Monotonic Single-Source DCSN achieving the same guarantee.

Application to protein-protein interaction networks
Methods such as Directed Condition Steiner Network can be key in identifying underlying structure in biological processes. As a result, it is important to assess the runtime feasibility of solving for an optimal solution. We show via simulation on human protein-protein interaction networks, that our algorithm on single-source instances is able to quickly and accurately infer maximum likelihood subgraphs for a certain biological process.

Building the protein-protein interaction network
We represent the human PPI network as a weighted directed graph, where proteins serve as nodes, and interactions serve as edges. The network was formed by aggregating information from four sources of interaction data, including Netpath [33], Phosphosite [34], HPRD [35], and InWeb [36], altogether, covering 16222 nodes and 437888 edges. Edge directions are assigned where these annotations were available (primarily in Phopshosite and NetPath). The remaining edges are represented by two directed edges between the proteins involved. Edge weights were assigned by taking the negative logarithm of the associated confidence score, indicating that finding the optimal Steiner Network would be the same as finding the most confident solution (assuming independence between edges). Confidence data was available for the largest of the data sets (InWeb). For HPRD edges that are not in InWeb, we used the minimum nonzero confidence value by default. For the smaller and highly curated datasets, Phopshosite and NetPath, we used the maximal confidence level.

Solving DCSN to optimality
Definition 6 (Single-Source DCSN) This is the special case of DCSN in which the demands are precisely (a, b 1 , c 1 ), (a, b 2 , c 2 ), . . . , (a, b k , c k ) , for some root a ∈ V . We can assume that c 1 ≤ c 2 ≤ · · · ≤ c k .
We can derive a natural integer linear program for the Single-Source Directed Condition Steiner Network in terms of network flows, with each demand being met by a flow from source to target: Each variable d uvc denotes the flow through edge (u, v) at condition c, if it exists; each variable d uv denotes whether (u, v) is ultimately in the chosen solution subgraph; k c denotes the number of demands at condition c . The first constraint ensures that if an edge is used at any condition, it is chosen as part of the solution. The second constraint enforces flow conservation, and hence that the demands are satisfied, at all nodes and all conditions. We note that DCSN easily reduces DCSP, as outlined in Theorem 2. However, DCSP is a special case of Single-Source DCSN. Therefore, the integer linear program defined above can be applied to any DCSN instance with a transformation of the instance to DCSP (Fig. 3).

Performance analysis of integer linear programming
Given the protein-protein interaction network G, we sample an instance of the node-variant Single-Source DCSN as so 3 : • Instantiate a source node a. • Independently sample β nodes reachable from a, for each of the C conditions, giving us {b 1,1 , . . . , b β,C }. • For each node v ∈ V , include v ∈ V c if v lies on the shortest path from a to one of {b 1,c , .., b β,c } • For all other nodes v ∈ V for all c, include v ∈ V c with probability p.
Using a workstation running an Intel Xeon E5-2690 processor and 250 GB of RAM, optimal solutions to instances of modest size (generated using the procedure just described) were within reach ( Table 2): We notice that our primary runtime constraint comes from C, the number of conditions. In practice, the number of conditions does not exceed 100.
In addition, we decided to test our DCSN ILP formulation against a simple algorithm of optimizing over each demand independently via shortest path. Theoretically, the shortest path method can perform up to k times worse than DCSN. We note that having zero weight edges complicates the comparison of algorithms' performance on real data. The reason is that we can have the same weight for a large and small networks. Instead, we wanted to also take into account the size of the returned networks. To do that we added a constant weight for every edge. Testing over a sample set of instances generated with parameters β = 100 , C = 10 , p = 0.25 , we found that the shortest path method returns a solution on average 1.07 times more costly.
Therefore, we present a model showing preliminary promises of translating and finding optimal solutions to real world biological problems with practical runtime.

Conclusion and discussion
In this paper we introduced the Condition Steiner Network (CSN) problem and its directed variant, in which the goal is to find a minimal subgraph satisfying a set of k condition-sensitive connectivity demands. We show, in contrast to known results for traditional Steiner problems, that this problem is NP-hard to approximate to a factor of C − ǫ , as well as k − ǫ , for every C, k ≥ 2 and ǫ > 0 . We then explored a special case, in which the conditions/graphs satisfy a monotonicity property. For such instances we proposed algorithms significantly beating the pessimistic lower bound for the general problem; this was accomplished by reducing the problem to certain traditional Steiner problems. Lastly, we developed and minimize subject to