Finding low-conductance sets with dense interactions (FLCD) for better protein complex prediction

Background Intuitively, proteins in the same protein complexes should highly interact with each other but rarely interact with the other proteins in protein-protein interaction (PPI) networks. Surprisingly, many existing computational algorithms do not directly detect protein complexes based on both of these topological properties. Most of them, depending on mathematical definitions of either “modularity” or “conductance”, have their own limitations: Modularity has the inherent resolution problem ignoring small protein complexes; and conductance characterizes the separability of complexes but fails to capture the interaction density within complexes. Results In this paper, we propose a two-step algorithm FLCD (Finding Low-Conductance sets with Dense interactions) to predict overlapping protein complexes with the desired topological structure, which is densely connected inside and well separated from the rest of the networks. First, FLCD detects well-separated subnetworks based on approximating a potential low-conductance set through a personalized PageRank vector from a protein and then solving a mixed integer programming (MIP) problem to find the minimum-conductance set within the identified low-conductance set. At the second step, the densely connected parts in those subnetworks are discovered as the protein complexes by solving another MIP problem that aims to find the dense subnetwork in the minimum-conductance set. Conclusion Experiments on four large-scale yeast PPI networks from different public databases demonstrate that the complexes predicted by FLCD have better correspondence with the yeast protein complex gold standards than other three state-of-the-art algorithms (ClusterONE, LinkComm, and SR-MCL). Additionally, results of FLCD show higher biological relevance with respect to Gene Ontology (GO) terms by GO enrichment analysis.

PPI networks. Many algorithms have been developed and applied for this purpose of detecting protein complexes.
These existing algorithms can be grouped into three categories. The first category includes the algorithms that mimic Markovian random walk on graphs, pioneered by MCL [7]. MCL does not have explicit mathematical definitions for the desired properties of subnetworks to detect as protein complexes. Similar to random walk, it iteratively implements "Expand" and "Inflation" operations to generate non-overlapping complexes. R-MCL [8] and SR-MCL [9] are improved versions of MCL. R-MCL penalizes the large complexes at each iteration in order to obtain more size-balanced complexes with a similar number of nodes within them. SR-MCL executes R-MCL many times to yield overlapping complexes. All those algorithms have shown good empirical performance, despite the mystery of parameter tuning and the lack of theoretic understanding of their working mechanisms.
Algorithms in the second category do not directly predict complexes according to the topological structure of subnetworks but resemble traditional clustering methods based on derived similarity measures between nodes or edges. For example, MCODE [1], CFinder [10], and RRW [11] grow complexes from single nodes by iteratively adding similar nodes in terms of different similarity criteria that help form local dense subnetworks. However, they only concentrate on the internal connectivity of the subnetworks and neglect the connectivity between the subnetworks and the rest of the networks. LinkComm [12] represents networks with edge graphs, whose nodes are interactions and edges reflect the similarity between interactions, and derives potential complexes by hierarchical clustering to partition the edge graphs.
Algorithms in the third category detect complexes based on explicit topological definitions of protein complexes. For example, modularity [13] and conductance [6,14] are two widely used definitions. Algorithms based on modularity [15] aim to detect subnetworks that have higher than expected internal connections. And algorithms, such as ClusterONE [6], based on finding low-conductance sets, focus on the separability of the subnetworks, which can be quantified by the ratios between the external connections of subnetworks and the total number of interactions of the proteins within the subnetworks. However, these methods have their own limitations. Modularitybased methods have the inherent resolution problem [16], which leads to ignorance of small-size protein complexes. Algorithms based on conductance minimization [6,17] consider the relationships between the internal connections and the external connections of subnetworks, but neglect the density of the interactions within the subnetworks.
In this paper, we propose a two-step algorithm FLCD (Finding Low-Conductance sets with Dense interactions) to detect protein complexes that have dense interactions inside and sparse interactions outside in a given PPI network. FLCD explicitly takes care of both the internal and external connectivity of protein complexes in two steps. FLCD first identifies a low-conductance set around a protein, which is locally well separated from the rest of the network. Then a densely connected subnetwork within the low-conductance set is detected based on the definition of the edge density of a subnetwork proposed in [18]. We compare our FLCD with three state-ofthe-art overlapping complex prediction algorithms, which are ClusterONE [6], LinkComm [12], and SR-MCL [9], respectively. Experimental results on four different yeast PPI networks from different publicly accessible databases demonstrate that our FLCD outperforms all competing algorithms for biological significance in terms of yeast protein complex gold standards and Gene Ontology (GO) term annotations [19].

Results and discussion
We first introduce the implementation details of the algorithms that we take for comparison; the information of the PPI networks, the reference protein complex datasets as our gold standards, and the GO terms we use for evaluation; and the criteria for the performance comparison. In order to demonstrate the robust performance of FLCD, we then compare predicted protein complexes from three selected state-of-the-art protein complex prediction algorithms based on two golden standard protein complex datasets on four public yeast PPI networks. What's more, we apply GO enrichment analysis to the entire set of detected complexes by all the competing algorithms. At the end, we illustrate differences between protein complexes predicted by all competing algorithms corresponding to specific reference complexes to further demonstrate the superiority of our FLCD.

Algorithms, data, and evaluation metrics Algorithms
We compare our FLCD algorithm with other three stateof-the-art overlapping complex prediction algorithms, which are ClusterONE [6], LinkComm [12], and SR-MCL [9]. The JAVA implementation of ClusterONE does not require any tuning parameters. For LinkComm, we set the tuning parameter t (the threshold to cut the dendrogram for hierarchical clustering) to 0.2 that achieves the best performance empirically in our experiments. For SR-MCL, we set the inflation parameter I = 3 and other parameters to their default settings since they yield the best results in our experiments. We set the only parameter k of our FLCD, the size of local neighbors based on personalized PageRank computation, to 20.

Data
We take four yeast PPI networks for performance evaluation: SceDIP, SceBG, SceIntAct, and SceMINT, extracted respectively from the Database of Interacting Proteins (DIP) [2], the Biological General Repository for Interaction Datasets (BioGRID) [3], the IntAct Molecular Interaction Database (IntAct) [4], and the Molecular INTeraction database (MINT) [5]. We note that we only consider protein-protein interactions by removing all genetic interactions from SceBG. We download the protein complex gold standards from the supplementary data in [6], which are obtained from the Saccharomyces Genome Database (SGD) [20] and the Munich Information Center for Protein Sequences (MIPS) [21] databases. For each PPI network, we remove reference protein complexes if their size smaller than 3 or half of the proteins of them are not in the network. The detailed information of four PPI networks and the gold standard reference complex datasets are provided in Table 1.
Due to the possible incompleteness of the reference protein complexes, we further examine the biological relevance of every predicted complex by GO enrichment analysis. We download the mappings of yeast genes and proteins to GO terms according to [20] (version 20150411).

Evaluation metrics for protein complex prediction
For the protein complex prediction, we assess the performance of all competing algorithms by a composite score consisting of three quality measures: F-measure [9,14]; the geometric accuracy (Acc) score [14]; and the maximum matching ratio (MMR) [6]. For fair comparison, we remove predicted complexes of two or fewer proteins by all competing algorithms.
For a gold standard reference protein complex set C = {c 1 , c 2 , . . . , c n } and a set of predicted complexes S = {s 1 , s 2 , . . . , s m }, the F-measure is defined as the harmonic mean of precision and recall defined as follows: in which N cs = {s i ∈ S|NA(c j , s i ) ≥ 0.25, ∃c j ∈ C} is the set of the complexes that match to one or more reference protein complexes; |N cs | is the size of the set N cs .  [9,22], where NA(c i , s j ) = |c i ∩ s j | 2 |c i | × |s j | is called neighborhood affinity. Finally, the F-measure is The geometric accuracy (Acc) score is the geometric mean of two other measures -the cluster-wise sensitivity (Sn) and cluster-wise positive predictive value (PPV) [6]. Given m predicted and n reference complexes, let t ij denote the number of proteins that exist in both predicted complex s i and reference complex c j , and w j represent the number of proteins in reference complex c j . Then Sn and PPV can be computed as ( The Acc score provides a balanced measure of Sn and PPV: Acc = √ Sn × PPV. The maximum matching ratio (MMR) is the ratio of the weight of maximum weight matching to the size of the reference set.

GO enrichment analysis
Suppose that a given PPI network has N proteins with M proteins annotated with one GO term and the predicted complex has n proteins with m proteins annotated with the same GO term. The p-value of the complex enriched with that GO term can be calculated as similarly done in [23]: We choose the lowest p-value of all its enriched GO terms for a predicted complex as its final p-value. A GO term is statistically significantly enriched when the pvalue of any complex corresponding to this GO term is lower than 1e−3.

Comparison on protein complex prediction
We apply all competing algorithms to search for potential protein complexes in four yeast PPI networks and compare them in terms of the composite score, consisting of F-measure, Acc score and MMR based on both the SGD and MIPS reference protein complex datasets.
We note that the different sizes and different numbers of detected complexes would affect the scores for the metrics that we have employed. However, in the context of complex prediction, there is no universal gold-standard metric. Hence, we apply three aforementioned metrics that have been commonly adopted in many other related works [6,9]. We also note that the average sizes of the complexes generated by FLCD in our experiments are from 6 to 8 for four networks under study. The average complex sizes are indeed comparable to the average sizes of detected complexes by other algorithms. For example, the average sizes of complexes produced by LinkCommunity are from 5 to 6; The average sizes of complexes produced by ClusterONE are from 7 to 9; The average sizes of complexes produced by SR-MCL are from 8 to 10. Furthermore, the total numbers of predicted complexes yielded by FLCD, LinkCommunity and SR-MCL are much larger than that of ClusterONE. The reason is that the post-processing procedure of ClusterONE filters out complexes with lower scores but FLCD and LinkCommunity output all complexes without filtering.
As shown in Figs. 1 and 2, FLCD clearly outperforms other state-of-the-art algorithms for all four networks on both SGD and MIPS reference datasets. Therefore, the complexes detected by FLCD have the best correspondence with the reference datasets. The detailed evaluation scores in Figs. 1 and 2 are displayed in Tables 2 and 3, respectively.
When we take SGD reference dataset as our gold standard protein complexes, from Table 2, we find that FLCD consistently achieves the best MMR scores among all competing algorithms because FLCD is the only algorithm that can capture the desired network structure of protein complexes. In the table, we also compare Fmeasure and the precision and recall scores that are used to compute F-measure. We observe that for all four PPI networks, FLCD predicts the largest number of matched reference protein complexes, and therefore FLCD attains the best recall scores for all PPI networks. With respect to the precision score, FLCD is the best for SceMINT but ClusterONE performs the best for the rest. However, since the post-processing step in ClusterONE only keeps the dense complexes, ClusterONE has low coverage. Based on the precision and recall scores, we find that FLCD attains the best F-measures for SceDIP and SceMINT PPI networks and ClusterONE obtains the best scores for SceBG and SceIntAct PPI networks. In addition to MMR and F-measure, we show comparison on the cluster-wise sensitivity (Sn), the cluster-wise positive predictive value (PPV) and the Acc score. We notice that FLCD has the best Acc scores for SceBG and SceIntAct. LinkComm obtains the best Acc scores for SceDIP and SceMINT, since LinkComm detects several large-size and many small-size complexes, which favors both the Sn and PPV scores [6]. We also compare the coverage of the competing algorithms and notice that SR-MCL has the largest coverage and FLCD has competitive coverage to SR-MCL. Here, the coverage is defined as the number of proteins covered by all predicted complexes, which is typically used to evaluate whether complex prediction algorithms can help comprehensively predict functionalities for all the proteins in a given network.
For MIPS reference dataset, we notice the similar trend for the evaluation scores in Table 3. FLCD finds the largest number of matched reference complexes in MIPS and attains the best recall scores, F-measures and MMR scores for all four PPI networks. The Acc scores of FLCD are competitive to LinkComm, which achieves the best Acc scores for all four yeast PPI networks. FLCD covers the competitive number of proteins to SR-MCL, which covers the largest number of proteins in all four yeast PPI networks. However, by the overall performance, which is represented by the composite score, FLCD is superior to other competing algorithms as shown in Fig. 2.
In summary, considering the composite score based on three metrics, our FLCD outperforms the other algorithms. To further validate all competing algorithms, we perform GO enrichment analysis in the next section to see whether all predicted complexes by different algorithms have significant biological meaning.

Comparison on GO enrichment analysis
We perform GO enrichment analysis for all protein complexes predicted by the competing algorithms and report  the percentages of the predicted protein complexes that are significantly enriched with at least one GO term and the total number of GO terms that are enriched in the predicted complexes in Table 4. We find that our FLCD achieves the best percentages of the enriched predicted protein complexes in SceDIP and SceIntAct PPI networks. ClusterONE obtains the best percentages for SceBG and SceMINT PPI networks but with the smaller number of GO terms enriched in the detected complexes because ClusterONE may remove meaningful functional modules in its post-processing step. Furthermore, the protein complexes detected by FLCD are significantly associated with the largest number of GO terms over all competing algorithms on all four PPI networks.
To further examine the statistical significance of the complexes detected by the competing algorithms, we compare the p-values of the complexes under GO terms of biological process, molecular function, and cellular component domains. We use the lowest p-value for each predicted complex and show the comparison of the statistical significance of the complexes detected by all competing algorithms in Fig. 3. The y-axis of Fig. 3 represents the negative log-p-values while the x-axis is the ordered list of the complexes detected by all competing algorithms in terms of their negative log-p-values. Since complexes with significant biological relevance have lower p-values, higher values in Fig. 3 represent the higher quality of the detected complexes. As shown in Fig. 3, for all four  The outperformance of FLCD further demonstrates that network structure that has dense internal connectivity and sparse external connectivity can better depict complexes of biological significance and FLCD provides an effective way to predict complexes with the desired network structure through explicitly taking care of internal and external connectivity of potential subnetworks.

Examples of predicted complexes
We further show the differences between the competing algorithms by illustrating the predicted complexes corresponding to two specific reference protein complexes. The first reference protein complex is the Smc5-Smc6 complex. In Fig. 4, the Smc5-Smc6 complexes predicted by FLCD, ClusterONE, LinkComm, and SR-MCL are displayed from (a.1) to (a.4), respectively. We notice that FLCD successfully identifies the Smc5-Smc6 complex as shown in Fig. 4(a.1). ClusterONE fails to detect the protein annotated as NSE4, probably due to the inaccuracy of the greedy algorithm used in ClusterONE. Also, we find that the protein annotated as GEX1 only interacts with the protein NSE3 but it is falsely added to the Smc5-Smc6 complex by ClusterONE. Because ClusterONE focuses on the separability of a complex but does not directly consider the internal density of the complex, it may mistakenly add proteins with small degrees into the final result. The complex in Fig. 4(a.3) predicted by LinkComm contains false positives and false negatives since the similarities between interactions used in LinkComm can not describe the topological structure of protein complexes. In Fig 4(a.4), we find out that the Smc5-Smc6 complex predicted by SR-MCL consists of many false positives. However, it is hard to explain the performance of SR-MCL on predicting the Smc5-Smc6 complex due to the unclear working mechanism of SR-MCL. Similarly, we show the predicted RNase complexes by all competing algorithms in Fig. 4 from (b.1) to (b.4). In (b.1), we observe that FLCD detects all proteins in the reference RNase complex but mistakenly includes the protein SKI7 due to the existence of false positive interactions between SKI7 and proteins in RNase complex. In addition to SKI7, the predicted complex by ClusterONE (shown in Fig. 4(b.2)) contains two false positive proteins with very small degrees due to the ignorance of the internal density. Because LinkComm does not explicitly characterize the separability of the complexes, it also recruits some false positive proteins as clearly shown in Fig. 4(b.3). For the complex obtained by SR-MCL, we note that it has lots of false positive proteins and the topological property of the predicted complex is not clear.

Conclusions
We propose an algorithm FLCD to predict protein complexes in protein-protein interaction networks. FLCD can better characterize the topological structure of a protein complex, which is densely connected inside and well  algorithms. b.1 to b.4 are RNase complexes predicted by FLCD, ClusterONE, LinkComm, and SR-MCL, respectively. Nodes in red are proteins in the reference RNase complex and nodes in white are proteins outside the reference RNase complex separated from the rest of the networks. We compare FLCD with other three state-of-the-art algorithms on protein complex prediction. The comparison results show that FLCD achieves superior performances. Furthermore, GO enrichment analysis of the results of the competing algorithms demonstrates that FLCD finds more biologically meaningful complexes, within which proteins tend to be in the same cellular components and have similar functions and/or participate in the same biological processes.

Terminologies and definitions
Let an undirected graph G = (V , E) represent a PPI network, where V denotes the set of proteins in G and E is the interaction set. A is the adjacency matrix of G with A ij = A ji and A ij = 1 denoting node i interacts with node j and A ij = 0 otherwise. The degree matrix D of G is a diagonal matrix with D ii = d i , where d i = j A ij is the number of interactions connecting to protein i.
For a set S of proteins, the conductance of S in G is defined as [17] φ(S) = |E(S,S)| min vol(S), vol(S) where E(S,S) denotes the edge cut, the set of edges between the set S and its complement setS, | · | denotes the set size, and vol(T) = i∈T d i is the number of all incident interactions of the set T. Here we make a mild assumption that vol(S) vol(V ) for a small protein complex S in the large-scale PPI network G, which means vol(S) = min vol(S), vol(S) . Hence, we have where A S is the adjacency matrix of the induced subnetwork with respect to set S and D S is the degree matrix for the nodes in S, where D S ii = j A ij = d i for i ∈ S. For the same set S, the density of S is defined as [18] where 1 i∈S is the indicator function depending on whether i ∈ S.

Motivation
FLCD is motivated by conductance minimization to identify well separated subnetworks in a given network. However, FLCD can overcome the problem of conductance minimization, which pays no attention to the internal connectivity within subnetworks as potential protein complexes. Figure 5 shows a motivating example: We can find two complexes enclosed in the red dotted lines in the network based on conductance minimization. The conductances of the complexes within red dotted lines are 2 11 Fig. 5 A motivating example for FLCD. Red dotted lines mark the complexes detected based on conductance minimization. Blue dashed lines mark the complexes predicted by our FLCD algorithm. Nodes with green border lines are removed by FLCD due to the lack of dense interactions and 2 17 and the conductances of complexes within blue dashed lines are 3 10 and 3 16 . Obviously, the conductances of the complexes within red dotted lines are lower than the complexes within blue dashed lines, indicating that the complexes within red dotted lines are topologically more separable than the complexes within blue dashed lines. However, the complexes within the blue dashed lines are more likely to be the desired complexes since the nodes with green border lines can not be confidently grouped into potential protein complexes due to their low degrees.
FLCD explicitly considers both the separability and internal edge density of complexes in two steps respectively. At the first step, it takes care of the separability of complexes by ensuring low conductance to hope for the complexes to have unique biological functions. At the second step, FLCD preserves the densely connected parts of the complexes identified in the first step. Because PPI networks are noisy and typically sparse, instead of finding cliques, we use the definition of internal density in (7) to search for dense subnetworks as final predicted complexes.

Searching for a low-conductance set H * v
Given a starting protein v, our goal is to find a protein set H * v with low conductance including v. We first apply the algorithm proposed in [17] to find a potential set H with low conductance, then the minimum-conductance set H * v in H is identified through solving a mixed integer programming (MIP) problem exactly.
Following [17], a low-conductance set including v can be efficiently approximated via the personalized PageRank vector of v. The personalized PageRank vector p(α, v) of v on G is the stationary distribution of the random walk on G, in which at every step, the random walker has the probability of α to restart the random walk at v and otherwise performs a lazy random walk. Mathematically, p(α, v) is the unique solution to where α ∈ (0, 1] is the "teleportation" constant, e v is the indicator vector of v and W = 1 2 (I +D −1 A) is the underlying probability transition matrix of the lazy random walk. We apply the local algorithm in [17] to efficiently approximatep ≈ p(α, v). Then we sort the nodes based onp and attain an ordered set H = {v 1 , v 2 , . . . , v n }, whose elements satisfyp(v i ) >p(v i+1 ). Inspired by PageRank-Nibble [17] that sweeps the ordered set H to get the low-conductance set, we propose to find the minimum low-conductance set within a subnetwork of size k, which consists of the top k elements in H, by solving a MIP problem. We take the top k elements out of H, which are more likely to comprise a low-conductance set with v, and put them in H. The minimum-conductance set H * v in H can be derived by solving the following optimization problem based on (6): min: where x is a binary vector with x i = 1 indicating that node i in H is assigned into H * v and x i = 0 otherwise; and d H is a vector containing the degrees of every node in H. We force node v to be in the low-conductance set by setting x v = 1. By algebraic manipulations, (9) can be transformed into the following equivalent formulation: After using standard techniques [24] to linearize zx i and x i x j , the optimization problem can be solved by any MIP solver, such as Gurobi [25]. Because the size of |H| = k is much smaller than |V | = n and we only focus on identifying one low-conductance set, we can efficiently obtain the minimum-conductance set H * v in H by solving (10) exactly.
If node v is in a connected component of size k and we set k > k , then we might have a trivial solution that the low-conductance set is the connected component with conductance 0. To avoid this, we apply the following procedure. We check every derived low-conductance set of size k to see whether it has exactly 0 conductance, which implies that it is a connected component with size k . If that is the case, we then set k = k − 1, and re-solve the MIP to get a non-trivial solution.

Conservation of the densest subnetwork C
The induced subnetwork G v with respect to the protein set H * v is well separated from the rest of the network; however, there may exist nodes with low degrees in H * v . As illustrated in Fig. 5, to remove low-degree nodes (nodes with green border lines) as well as reserve densely connected subnetworks, we apply the definition of the internal density (7) to find the densest subnetwork in H * v . Because the problem size is small for such a local optimization problem, we can again take the full advantages of the power of MIP solvers. The node set C * v ∈ H * v corresponding to the densest subnetwork can be identified based on (7) by deriving the exactly optimal solution to the following MIP problem: max: where 1 is an all-one vector and r is the binary vector indicating the memberships of the nodes from H * v in the densest subnetwork. This optimization problem explicitly searches for the subnetwork with the highest internal density and it can be transformed into the equivalent problem, as similarly done in (10): which can also be cast into the MIP framework with the exactly optimal solution obtained by using standard MIP solvers after linearization [24].

The FLCD algorithm
The step-by-step procedure of FLCD algorithm is given in Table 5. The FLCD algorithm screens every protein with degree higher than two. For each selected protein, the FLCD algorithm first searches for the minimumconductance set around it and then finds the densest subnetwork in the minimum-conductance set, which is Table 5 The FLCD algorithm

Algorithm: The FLCD Algorithm
Input: S = V and k = 20.
Output: A set of predicted complexes R. Finding the lowest-conductance set H * v ∈ H v based on (10). 5 Identifying the node set C * v of the densest subnetwork in H * v based on (12). 6 Considering C * v as one predicted complex, let R = {R, C * v } and S = S − v. considered as a predicted complex. After screening every possible proteins, we remove the duplicated complexes and complexes with size smaller than three. There is only one parameter k for the FLCD algorithm, where k can be considered as the upper bound of the sizes of the desired protein complexes. Also, the MIP problems (10) and (12) are both NP hard. The actual computational complexity of solving these MIP problems depends on the problem size of these local problems determined by k. The smaller k is, the less time it takes the FLCD algorithm to search for subnetworks as potential protein complexes. Throughout the experiments in this paper, we set k = 20.