Identifying protein complexes in PPI network using non-cooperative sequential game

Maulik, Ujjwal; Basu, Srinka; Ray, Sumanta

doi:10.1038/s41598-017-08760-x

Download PDF

Article
Open access
Published: 21 August 2017

Identifying protein complexes in PPI network using non-cooperative sequential game

Ujjwal Maulik¹^na1,
Srinka Basu²^na1 &
Sumanta Ray³^na1

Scientific Reports volume 7, Article number: 8410 (2017) Cite this article

2042 Accesses
9 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Identifying protein complexes from protein-protein interaction (PPI) network is an important and challenging task in computational biology as it helps in better understanding of cellular mechanisms in various organisms. In this paper we propose a noncooperative sequential game based model for protein complex detection from PPI network. The key hypothesis is that protein complex formation is driven by mechanism that eventually optimizes the number of interactions within the complex leading to dense subgraph. The hypothesis is drawn from the observed network property named small world. The proposed multi-player game model translates the hypothesis into the game strategies. The Nash equilibrium of the game corresponds to a network partition where each protein either belong to a complex or form a singleton cluster. We further propose an algorithm to find the Nash equilibrium of the sequential game. The exhaustive experiment on synthetic benchmark and real life yeast networks evaluates the structural as well as biological significance of the network partitions.

Detection of dynamic protein complexes through Markov Clustering based on Elephant Herd Optimization Approach

Article Open access 31 July 2019

Genome-wide inference of the Camponotus floridanus protein-protein interaction network using homologous mapping and interacting domain profile pairs

Article Open access 11 February 2020

Hierarchical graph learning for protein–protein interaction

Article Open access 25 February 2023

Introduction

Many cellular activities are carried out by proteins which physically interact with each other to form stoichiometrically stable complexes. These complexes interact with individual proteins or other complexes to form functional modules. The functional modules are observed to work in a coherent fashion even in the relatively simple model organisms like Saccharomyces cerevisiae¹. Study of the entire set of complexes from the data describing the physical interactions among proteins, is thus essential to understand the complex formations, and the higher level cellular organization².

New technological advancement in biotechnology have resulted in large-scale physical protein-protein interaction (PPI) data for different organisms which is used to reconstruct the PPI network. The PPI network can be described as an undirected graph where a node represents a protein and an edge represents the interaction between two proteins. Protein complexes constitute modular units within the PPI networks³. From a biological perspective, this modularity is a division resulted of evolution to provide robustness against mutation and chemical attacks⁴. From a topological perspective, this modularity is a result of proteins within complexes being densely connected to each other than to the rest of the network^{1, 5, 6}. Thus the computational approaches to protein complex identification from PPI data can be formulated as dense subgraph detection from PPI networks.

Since last decade several computational methods have been developed to infer the protein complexes where a protein complex is defined as a dense subgraph. A large class of existing literature for protein complex identification is based on network motif identification like clique finding approaches in refs 3, 7–9. The motif identification based methods fail to identify the protein complexes of irregular shapes. Another class of graph clustering techniques based on global criteria¹⁰ suffers from issues like resolution limit. Local search based approaches, employing merging or growing of clusters have proved to be more efficient. A popular local search based method, Molecular Complex Detection (MCODE) algorithm, proposed in ref. 11, starts by assigning certain weights to all the nodes based on their the local neighborhood density. The method then iteratively adds vertices having weights above certain threshold, starting with the maximum weighted node as a seed node. Restricted Neighborhood Search Clustering (RNSC) proposed in ref. 12, uses a cost-based local search algorithm to minimize the number of intra-cluster and inter-cluster edges. The method starts from an initial random solution, and iteratively moves on to form groups by assigning each vertices to a group which reduces the general cost. Recently, an overlapping neighborhood expansion mechanism based approach, ClusterONE, for detecting overlapping protein complexes is proposed in ref. 2. ClusterONE uses a greedy search procedure to find highly cohesive groups of vertices. The search process is performed repeatedly considering different seeds to form multiple, possibly overlapping groups. To obtain the resulting complexes, the overlapping groups are merged based on their overlap scores.

Other popular and recent developed methods for protein complex identification includes Affinity Propagation¹³, multi-objective based framework^{14, 15}, PPSampler2¹⁶ and PEWCC¹⁷. Affinity Propagation¹³ groups data points based on their similarity. All data points are initially considered as potential “exemplars”. The algorithm first finds sub-paths that allow easy message exchanges between nodes. In subsequent steps, message exchange between the nodes are continued until a high quality cluster is formed. The multi-objective based framework¹⁴ optimizes two objective functions, one is based on density of a module and another on the Gene Ontology based semantic similarity among proteins, resulting in dense and functionally homogeneous protein complexes. PPSampler2¹⁶ is a modification of its previous version PPSampler¹⁸ which partitions proteins into clusters based on a scoring function generated from the topological properties of the PPI network. A novel algorithm called PEWCC is proposed in Zaki et al., where reliability of the protein interaction data is assessed before partitioning the interaction network. PEWCC performs weighted clustering to partition the refined interaction networks into maximal cliques which serve as protein complexes.

A common limitation of the above mentioned approaches is that they ignore the self-organizing nature of the protein complexes, primarily triggered by the biochemical factors including hydrophilic energy, electrostatic forces between the proteins. Game theory, a natural choice of framework to model the self-organizing nature, could focus on the underlying factors that drive the protein complex formation. Bohl et al.¹⁹, reviews game theoretical concepts in cell biology and molecular biology focusing on the subcellular level by considering viruses, genes, and molecules as players. The existing works mostly uses two player simultaneous game for protein complex identification.

In this paper, we model the problem of protein complex detection as a multi-player sequential move non-cooperative game that models the proteins as rational players. The underlying hypothesis for the model are: (i) proteins always act in selfish manner: even when the proteins forms complexes it only optimizes its own objective(s) (ii) protein complexes are dense substructures of low diameters: most of the real life networks tend to demonstrate small world property, and (iii) there might exist some proteins which do not belong to any complex: this could be due to the physio-chemical properties. We formulate a Partial Dense vertex Cover (PDC) game for protein complex identification. The strategy set of a player is determined by a preference relation which is a transitive, reflexive function over the set of all possible vertex covers. The Nash equilibrium of the game is the minimum partial dense vertex cover. We propose an algorithm, NashPDC, to find the Nash equilibrium of the PDC game. The resulting partition generated by NashPDC represents the protein complexes.

A thorough experiment is carried out by comparing the performance of the proposed method with that of the other well-known methods such as clusterONE², MCODE¹¹, RNSC¹², MCL²⁰, PPSampler2¹⁶ and PEWCC¹⁷. The biological relevance of the identified complexes are assessed by gene ontology and pathway based analysis. The experiment uses synthetic benchmark network and real life yeast PPI network for evaluation. The experimental results show a significant improvement achieved by the proposed approach over the other methods.

Materials and Methods

We represent a PPI network using an undirected graph G = (V, E) where a node v ∈ V represents a protein and an edge e ∈ E represents the interaction between two proteins. In this section, we discuss the proposed method which is based on the following basic concepts of graph theory and non-cooperative game theory.

Given an unweighted graph G = (V, E), G(S) denote the subgraph induced by S on G where S ⊆ V. The set of edges of G(S) is denoted by E(G(S)). For any vertex v ∈ V, the p ^th order open neighborhood, N _p(v), is the set of vertices connected to v by a path of length less than or equal to p. The closed p ^th order neighborhood of vertex v is N _p[v] = N _p(v) ∪ {v}. The degree of vertex v ∈ V, d _v = |N ₁(v)|. The degree density, α(G), is defined as the ratio of the minimum degree of G to the maximum possible degree, i.e. $\alpha (G)=\frac{mi{n}_{v\in V}{d}_{v}}{|V|-1}$. The edge density, δ(G), is defined as the ratio of the number of edges in G to the total number of possible edges, i.e. $\delta (G)=\frac{|E(G)|}{(\begin{array}{c}|V|\\ 2\end{array})}$. The radius, R(G), also termed as the minimum eccentricity of G is defined as the maximum shortest path between any two pairs of nodes, i.e. R(G) = max _u,v∈V|SP(u, v)| where SP(u, v) denotes the shortest path between u, v. The local transitivity of a vertex v, denoted by t(v) is defined as the ratio of the triangles connected to the vertex and the triples centered on the vertex, i.e. $t(v)=\frac{|\{(u,v),(v,w),(u,w)|(u,v),(v,w),(u,w)\in E(G)\}}{|\{(u,v),(v,w)|(u,v),(v,w)\in E(G)\}|}$. Further, the term cover indicate a set of vertices and a graph partition is defined as a set {C ₁, C ₂, …, C _K} (K is a positive integer and K ≤ |V|) such that – (i) ∀ k ∈ {1, …, K}, C _k ≠ ∅, (ii) ${\cup }_{k=1}^{K}{C}_{k}=V$ and (iii) ∀ k, l ∈ {1, …, K} with k ≠ l, C _k ∩ C _l = ∅.

In non-cooperative game theory, strategic form game models the interaction between a finite set of N rational players. In strategic form game a player’s decision problem is to choose a strategy that will counter best the strategies adopted by the other players. Each player is faced with this problem and the players can be thought of as simultaneously choosing their strategies from the respective strategy sets. A strategic form game is modeled by a three tuple $(N,({S}_{i}{)}_{i\in N},({\succcurlyeq }_{i}{)}_{i\in N})$ where -

N is a finite set of rational players
S _i denote the strategies or actions of player i while S = x_i∈N S _i is called the set of action profiles (or strategy profiles)
${\succcurlyeq }_{i}$ denote the preference relation which is a reflexive $(a\,\succcurlyeq \,a)$, transitive $(a\,\succcurlyeq \,b\,{\rm{and}}\,b\,\succcurlyeq \,c,\,{\rm{implies}}\,a\,\succcurlyeq \,c)$, total (for all elements a, b either $a\,\succcurlyeq \,b$ or $b\,\succcurlyeq \,a$) binary relation on the set of action profiles. We write $a\,\succ \,b$ if $a\,\succcurlyeq \,b$ but not $b\,\succcurlyeq \,a$. Intuitively, $a\,\succ \,b$ means that strategy b is preferable to a. The preference relation may also be defined based on the outcomes of the strategies.

It is assumed that in the strategic form game the set of strategies and the players’ preference relations are known to all the players. The only uncertainty concerns the actions chosen by the players. For i ∈ N let s _−i ∈ S _−i where S _−i denote the action profile of all the players in N\i. The best responses of player i given the actions of other players, s _−i, is defined as

$${B}_{i}({s}_{-i})=\{{s}_{i}\in {S}_{i}|({s}_{-i},{s}_{i})\,{\succcurlyeq }_{i}\,({s}_{-i},{s}_{i}^{^{\prime} }),\forall {s}_{i}^{^{\prime} }\in {S}_{i}\}$$

An action profile ${s}^{\ast }=({s}_{1}^{\ast },{s}_{2}^{\ast },\ldots ,{s}_{n}^{\ast })$ such that for each $i\in N,{s}_{i}^{\ast }\in {B}_{i}({s}_{-i})$ is in Nash equilibrium. In other words, in Nash equilibrium no player i has a profitable deviation from s ^*.

The framework

The proposed framework is based on the concept of partial dense vertex cover, defined as below.

Definition 1. Given an unweighted graph G = (V, E), a partial dense vertex cover C(G) is a collection of subsets of V, {C ₁, … C _k} with the properties:

1.
∀i ∈ {1, … k}, ∅ ≠ C_i ⊆ V
2.
∀i, j ∈ {1, …, k} with i ≠ j, C_i ∩ C_j = ∅
3.
C ₁ ∪ … ∪ C _k ⊆ V
4.
∀i ∈ {1, … k}, R(G(C _i)) ≤ p
5.
∀i ∈ {1, … k}, α(G(C _i)) ≥ λ and δ(G(C _i)) ≥ γ
6.
∀i ∈ {1, … k}, C _i is a locally maximal cover.
7.
the residual graph $G(V\backslash {\cup }_{i=1}^{k}{C}_{i})$ does not contain any induced subgraph satisfying the other conditions.

Here 0 < λ ≤ γ ≤ 1 and p is a positive integer.

A minimum partial dense vertex cover is a partial dense vertex cover with the minimum value of k. In this paper, the terms cover and coalition have been used interchangeably.

The proposed method models the protein-complex detection method as a partial dense vertex cover (PDC) strategic form game, $PDC=(N,{({S}_{i})}_{i\in N},{({\succ }_{i})}_{i\in N})$ as discussed below.

Each player i from the set of rational players N represents a node i ∈ V in G.
The strategy set S _i of a player i ∈ N is to – (i) propose a coalition, by sending out joining requests (ii) accept a joining request (iii) reject a joining request and (iv) leave a coalition. Once a player accepts and joins a coalition he is not allowed to propose a new coalition. A player if rejects a coalition joining request or leaves a coalition can subsequently propose a new coalition. A proposer, if do not receive satisfactory response may choose to leave the coalition.
${\succ }_{i}$ is the preference relation of a player i over the set of possible partitions. The preference relation determines the quality of the partial dense vertex cover. A player prefers to belong to a minimum partial dense vertex cover or to stay alone than to belong to any random partition.

Given the preference relation ${\succ }_{i}$, the Nash equilibrium of the PDC is a partition where no player can gain from unilateral deviation. For a PDC game, a partition p ^* is in Nash equilibrium, if for all other partitions p and for every player i, ${p}^{\ast }{\succcurlyeq }_{i}p$ when all the players in N\i plays their best response strategy. In other words, a minimum partial dense vertex cover is a Nash equilibrium of PDC game and vice versa. There may exist multiple Nash equilibriums. We eliminate the two trivial partitions i) where a single cover or a grand coalition forms and ii) where all the players form singleton coalitions.

The following example illustrates the above framework.

Example 1. Figure 1 shows a graph G with 12 nodes. We induce the PDC game on the graph G so that every node of G is mapped to a player in PDC. For λ = 0.6, γ = 0.65 and p = 2 the preference relation of the players are given as below.

$$\begin{array}{c}\{1,2,3,4,5,6\}{\succ }_{3}\{1,2,3,4,5\}{\succapprox}_{3}\ldots \\ \{1,2,3,4,5,6\}{\succ }_{4}\{1,2,3,4,5\}{\succ }_{4}\{1,2,3,4\}{\succ }_{4}\ldots \\ \{1,2,3,4,5,6\}{\succ }_{5}\{1,2,3,4,5\}{\succ }_{5}\{1,2,3,5\}{\succ }_{5}\ldots \end{array}$$

The partition {{1, 2, 3, 4, 5, 6}, {7, 8, 9, 10, 11, 12}} is in Nash equilibrium, as no player can gain from unilateral deviation.

We propose an algorithm, Nash equilibrium based Partial Dense vertex Cover Detection (NashPDC) to find the Nash equilibrium of the game PDC. The basic working principal of the Algorithm 1 is described below.

The players are ranked based on their geometric mean of degree and local transitivity (in descending order). This forms the rule of order. The first player in the rule of order is the one with the maximum links and maximum local transitivity. This is one of the ways to prioritize proteins. However, other approaches may be used.
The first player i of rule of order proposes a coalition C which is the p ^th order bounded neighborhood of i.
Each player j in C accepts the coalition joining request only if that is his best response strategy, otherwise rejects the invitation. Once all the players either accept or reject the coalition joining request, a stable coalition C forms. Subsequently, all the players in C quit the game.
The game play is continued by repeating the above steps with the players who do not belong to any coalition until no more stable coalition can be formed.

Complexity

At the worst case, every node sends a coalition request. Selecting a player from the rule of order takes O(logN) time. For each proposed coalition at most ${d}_{max}^{p}$ number of nodes are processed where d _max is the maximum degree in G. Thus the worst case complexity of the algorithm 1 is $O(N(\mathrm{log}\,N+{d}_{max}^{p}))$. With p = 2 the complexity reduces to $O(N(\mathrm{log}\,N+{d}_{max}^{2}))$.

Results

In this section, we evaluate the performance of our proposed method with that of some state-of-the art techniques. The evaluation is conducted on both synthetic benchmark and real life yeast PPI network data sets.

Dataset description

To test the correctness of the proposed method we first use the synthetic benchmark networks generated using the popular Girvan Newman (GN) model²¹. GN benchmark is a special case of the planted 4-partition model where a graph with 128 vertices are partitioned into 4 groups (modules) with 32 vertices each. The inter-module edges of GN model is controlled by the mixing parameter μ. In other words, the mixing parameter μ controls how well the modules are defined by controlling the number of edges with which a module is connected to the rest of the graph. When the value of μ is strictly less than 0.5, the expected number of links joining a node to others in different groups is less than those in the same group. This yields well defined groups (modules). As the value of μ drops to 0.5 or below, it becomes difficult to identify a module boundary from the rest of the network. Irrespective of the size and sparsity of the networks, GN benchmark dataset is a popular choice to test the correctness of different module detection algorithms and compare their performances. By varying the value of μ we generate networks with well defined modules as well as the networks with modules that are not so clearly distinguishable from the rest of the network. The later is a case similar to real life scenarios where all modules are not necessarily well defined. The code to generate the GN networks are freely available at the site https://sites.google.com/site/santofortunato/inthepress2.

Further we use real life yeast PPI networks - two experimental yeast PPI datasets (Gavin et al.²², Krogan et al.²³), yeast PPI interaction derived from DIP²⁴ and a manually curated high-quality yeast dataset from MIPS²⁵. The Krogan dataset has two variants, namely, the core data set which is referred to as Krogan core and the extended data set which is referred to as Krogan extended. The key topological properties of the PPI networks built from these dataset are given in Table 1. For comparing the resulting protein complexes with benchmark we downloaded gold standard data from the site http://yeast-complexes.russelllab.org/. It consists of 491 experimentally verified yeast protein complexes. We consider this as benchmark and compute the extent of overlap with the resulting clusters.

Table 1 The topological parameters of the PPI datasets. N is the number of nodes, E is the total edge count, δ the network density defined as $E/(\begin{array}{c}N\\ 2\end{array})$, d _avg is average degree and CC is the clustering coefficient. Krog_Cr and Krog_Ex are the Krogan Core and Krogan Extended data sets respectively.

Full size table

Validation on synthetic benchmark

We first test the performance of the proposed method on synthetic benchmark networks generated using GN model. The inter-module edges of GN model is controlled by the mixing parameter μ. We consider μ = {0.1, 0.2, …, 0.7}. To simulate the real life modules which are not very clearly distinguishable from the rest of the network, we vary the value of μ. To avoid any bias in the results obtained by conducting the experiments on a single GN synthetic network, we generate 100 instances of every configuration of GN model. The results obtained for each network configuration are then averaged and reported.

To quantitatively measure the correspondence between the ground-truth modules and the identified modules we use the metric Normalized Mutual Information (NMI). NMI, defined as below, measures the similarity between two partitions based on entropy^{26, 27}. For two given partitions π ^a,π ^b, NMI is:

$$\frac{-2{\sum }_{i=1}^{{k}_{a}}{\sum }_{j=1}^{{k}_{b}}{n}_{ij}^{ab}(\mathrm{log}\,\frac{{n}_{ij}^{ab}.n}{{n}_{i}^{a}.{n}_{j}^{b}})}{{\sum }_{i=1}^{{k}_{a}}{n}_{i}^{a}(\mathrm{log}\,\frac{{n}_{i}^{a}}{n})+{\sum }_{j=1}^{{k}_{b}}{n}_{j}^{b}(\mathrm{log}\,\frac{{n}_{j}^{b}}{n})}$$

where ${n}_{i}^{a}$ is the number of nodes in cluster ${C}_{i}^{a}\in {\pi }^{a}$ and ${n}_{ij}^{ab}$ is the number of common nodes between cluster ${C}_{i}^{a}\in {\pi }^{a}$ and ${C}_{i}^{b}\in {\pi }^{b}$. The value of NMI ranges between 0 to 1. Higher the value, the more similar the two partitions are with the maximum value 1 indicating two identical partitions. Figure 2 shows the NMI values for different values of μ. As shown in the figure, the proposed method correctly identifies all the modules for the values of μ upto 0.4. For higher values of μ as the inter module edge count becomes equal or more than the intra module edge count, identifying a module, thus, becomes difficult. As a result the NMI falls sharply when μ ≥ 0.5. The experiment shows the accuracy of the proposed method in identifying well defined modules.

Validation on null model

To verify the performance of the algorithm on the null model, we test our performance on random Erdös-Réyni (E-R) graphs²⁸. The nodes being randomly connectedly, the E-R network do not exhibit any modular structure. A good dense cluster detection algorithm should not identify any significant structure from E-R network. We run the proposed method on the E-R graphs. The proposed method do not identify any significant module. The experiment shows that the proposed method works correctly on null model.

Validation on noisy synthetic data

Presence of noise is a very common phenomenon when dealing with complex real life data sets. Noisy data occur due to many reasons including but not limited to erroneous measurements, sampling bias. The study of stability of a method against noise is thus essential. We study the stability of the proposed method when the endpoints of the edges of a graph is rewired to a random vertex with given rewiring probability (ρ). The initial graphs are generated using GN model. We measure the performance by computing the NMI between the known partition of the initial graph and the partition obtained from the perturbed graph for various values of ρ.

Figure 3 shows the NMI values for different values of ρ. As shown in the figure, the proposed method correctly identifies all the modules for the values of ρ upto 0.05. When ρ is above 0.25 the value of NMI drops sharply. The experiment shows that the proposed method correctly identifies the modules when a certain level of random noise present in the data and is thus robust against random noise.

Validation using connectivity density

To validate the identified modules from topological perspective, we use the connectivity density measure²⁹. The connectivity density of a module M is defined as follows:

$$\frac{{\sum }_{{n}_{i}\in M}d({n}_{i})}{N},$$

where d(n _i) represents degree of node n _i within the module M, N represents total number of connections. It simply denotes the ratio of total degrees of nodes within the module to the total number of connections. The experiment studies the relative change in the connectivity density when an identified protein complex is shifted a little. A shifted (replaced) module is obtained by randomly replacing a small portion of the proteins in a module with the same number of proteins outside of that module in such a way that the replacement proteins are connected with the proteins in the module but do not belong to it. For an identified protein module, a little shift is expected to decrease the connectivity density, which should not be expected from a random module. In addition, shift in random modules may results either increase or decrease in connectivity density. The experiment is carried out on the real life yeast PPI networks.

Figure 4(a,b) shows the results averaged over 300 randomization experiments. Figure 4(a) is a scatter plot of the densities of identified modules and replaced modules. As shown in Fig. 4(a) for most of the modules, 20% component replacement causes the density to decrease by a significant amount dropping below the original density. With the higher rate of replacements (30% and 40%), the density decrease is even higher. Figure 4(b) shows the summary statistics of the changes in the replaced module density. As shown in the figure, on an average the replaced module density is nearly 50% of the original module density for 20% component replacement and even lower for higher replacements. The observation suggests that the identified modules are indeed densely connected local subgraphs, and thus are good candidates for functional modules in the yeast protein network.

To know whether the obtained results is trivial and can be easily expected for any random modules we have performed an analysis. For this, we have repeated the same experiment for randomly selected modules from the PPI network. Figure 4(c,d) shows the scatter plots of the densities of random as well as replaced modules. It is observed that there is no distinct pattern which distinguishes the densities of modules with 20, 30 and 40% replacement.

Evaluating structural quality using ground truth

In this experiment the performance of the proposed method is compared with the base line methods by using metrics that measure the correspondence between the predicted complexes and the reference complexes. The metrics used for evaluation are: (i) Jaccard index based performance metric, matching score(D), (ii) sensitivity(Sn), positive predictive value(PPV), and accuracy(Acc)^{2, 30}.

Let P = {P ₁, P ₂, … P _n} and R = {R ₁, R ₂, … R _m} are the sets of predicted complexes and reference complexes, respectively. The matching score is the geometric mean of S and T, defined as:

$$D=\sqrt{S\times T}.$$

(1)

where S is the average predicted complex-wise overlap defined as $S=\frac{1}{n}{\sum }_{i=1}^{n}O{v}_{i}$, T is the average reference complex-wise overlap defined as $T=\frac{1}{m}{\sum }_{j=1}^{m}O{v}_{j}$, Ov _i is the predicted complex-wise overlap defined as $O{v}_{i}=ma{x}_{j=1}^{m}{o}_{ij}$, Ov _j is the reference complex-wise overlap defined as $O{v}_{j}=ma{x}_{i=1}^{n}{o}_{ij}$, o _ij is the overlap score defined as ${o}_{ij}=\frac{|i\cap j|}{|i\cup j|}$, i ∈ P and j ∈ R. Higher the value of D indicates better involvement of predicted complexes to the reference complexes and vice-versa.

The metrics Sn, PPV, Acc are defined on a contingency table T, where every element t _ji indicates the number of common proteins between reference complex j and predicted complex i. The General Sensitivity (Sn) of a clustering result is defined as:

$$Sn=\frac{{\sum }_{j=1}^{m}{N}_{j}S{n}_{j}}{{\sum }_{j=1}^{m}{N}_{j}}$$

(2)

where N _j is the number of proteins belonging to complex j and Sn _j is the reference complex-wise sensitivity for reference complex j defined as $S{n}_{j}={\max }_{i=1}^{n}\frac{{t}_{ji}}{{N}_{j}}$.

The General PPV(PPV) of a clustering result is defined as:

$$PPV=\frac{{{\sum }_{i=1}^{n}{T}_{.i}PPV}_{i}}{{\sum }_{i=1}^{n}{T}_{.i}}$$

(3)

where ${T}_{\mathrm{.}i}={\sum }_{j=1}^{m}{t}_{ji}$ is the marginal sum of a column i in the contingency table T and PPV _i is the predicted complex-wise positive predictive value for predicted complex i defined as $PP{V}_{i}={\max }_{j=1}^{m}\frac{{t}_{ji}}{{T}_{.i}}$.

Since Sn is maximum when every protein is assigned to the same cluster, while the PPV is maximum when every protein is in its own cluster, it is necessary to balance the two measures. The Geometric Accuracy (Acc) represents a trade-off between sensitivity and the positive predictive value and is defined as:

$$Acc=\sqrt{{S}_{n}\times PPV}\mathrm{.}$$

(4)

The advantage of taking the geometric mean is that it yields a low score when either the Sn or the PPV metric is low. High accuracy value thus indicates a high performance in terms of both Sn and PPV.

The results of evaluation are shown in Tables 2 and 3. In Table 2 the performance of the proposed method is compared with other five baseline methods, viz., MCODE, MCL, ClusterONE, RNSC, PPSampler2 and PEWCC in terms of the matching score D. As shown in the table, unlike other methods, the proposed method performs consistently well on all the data sets. The proposed method outperforms MCODE and MCL in most of the data sets. ClusterONE method attains higher D value for DIP, GAVIN and Krogan_Cr data sets while the same attains lower D value for other data sets. On the contrary, RNSC method attains higher D value for MIPS and Krogan_Ex data sets while the same attains lower D value for other data sets.

Table 2 Comparisons of performance of different algorithms using matching score.

Full size table

Table 3 Comparisons of results with respect to sensitivity, specificity and accuracy.

Full size table

Table 3 shows the performance of the methods in terms of the metrics Sn, PPV, Acc. It can be noticed from the table that the proposed method performs consistently well in each of the datasets. In terms of Sn the proposed method outperforms MCODE and clusterONE in all the data sets. However, RNSC and MCL exhibit higher accuracy (Acc) than the proposed method on MIPS and KROGAN-extended data sets. It is also important to note that only the proposed method consistently gives good Sn value on all the data sets. As shown in the result no other method attain consistently high Sn on all the data sets.

We further analyze the the complexes identified by the proposed method on Gavin dataset, and compare the same with that identified by clusterONE method as clusterONE attains second highest accuracy score after the proposed method. Among the experimentally verified benchmark complexes in Gavin data set, 10.62% complexes (52 complexes out of 490 complexes) are captured with more than 80% coverage by the proposed method. On the contrary clusterONE covers 8.16% benchmark complexes (40 out of 490) with more than 80% coverage. Figure 5 shows the layout of the three benchmark complexes: ‘Small subunit processom’, ‘complex-435’ and ‘complex-410’. Green and red nodes represent captured and non-captured nodes, respectively, by a particular method. From Fig. 5(a) it can be noticed that, for ‘Small subunit processom’ complex, the proposed method covers 89.66% proteins, while for clusterONE the percentage of coverage is 68.97%. Similarly, for the other two complexes, the coverage attained by the proposed method is much higher than that of clusterONE.

To further know the performance of the proposed method in large PPI data, we have utilized WI-PHI PPI database of yeast³¹, which consists 6,000 proteins and 50,000 PPIs. We run the proposed method on this data and compare the predicted clusters with the experimentally verified complexes. The results consists of 17 predicted clusters with minimum and maximum size eight and 39, respectively. The resulting sensitivity (S _n = 0.2384), positive predictive value (PPV = 0.4619) and accuracy (Acc = 0.3318) reveals that proposed method performed well in large PPI data.

To investigate whether the proposed method can detect sparse, low dense protein complexes we set the input parameter α (degree density) and δ(edge density) to lower values (α = 0.001, delta = 0.0008) and run the algorithm. We collected the resulting low dense clusters and compare these with benchmark protein complexes having lower density. Figure 6 shows two identified clusters which match with five benchmark low dense complexes such as: m-AAA protease complex, complex-329, Scs2/Opi1 complex, complex-319 and Cytochrome bc1 complex. It can also be noticed from the Fig. 6 that the unmatched proteins of the identified clusters exhibits high interactions among them and need to be explored further.

Evaluating biological significance

Functional similarity of the identified complexes

It has been observed that proteins within a complex are functionally similar^{14, 32, 33}. For understanding the biological roles and functions of proteins, functional similarity is a more informative measure compared to the structural and sequence similarity. The semantic similarity between Gene ontology (GO) terms is used to measure the functional similarity between proteins³⁴. Here we use Relevance measure proposed in ref. 35 to compute functional similarity between the identified modules.

Figure 7 shows the distribution of functional similarity scores for the identified modules. In Fig. 7(a) the left pane shows the fraction of identified modules having similarity score above certain value x, while the right pane shows the distribution of similarity score. As shown in the figure for Gavin data set more than 75% modules identified by the proposed method have semantic similarity score higher than 0.5 while for the MCODE, MCL, clusterONE and RNSC it is 53.66%, 39.02%, 61.34% and 48.78%. Similarly, as shown in Fig. 7(b) a large fraction of modules identified by the proposed method on the Krogan_Extended data set has similarity score above 0.5.

Gene ontology based analysis of the identified complexes

We have performed a GO and pathway based study to biologically validate the identified protein complexes. GO represents an important resource to describe the functional characteristics of genes in a module. In Table 4 we provide the most significant GO terms, GO-id of biological process (BP) annotation of top 15 identified protein complexes in Gavin dataset. In Table 4 last column provides the p-value of each annotated GO-term. The p-value is computed by comparing the GO terms shared by the genes in the module to the background distribution of annotation. We have utilized a widely used webserver David functional annotation tool (https://david.ncifcrf.gov/) to perform the gene enrichment analysis. Here, the obtained p-values are calculated by using Fisher’s exact test. The obtained p-values are also subsequently adjusted for multiple comparisons using a Bonferroni correction based on the number of genes in the modules. The p-value of a gene module signifies the probability of observing at least x number of genes out of total n genes in the module annotated to a particular GO terms, given that the proportion of genes in the whole genome are annotated with that GO terms. So a p-value of a module closer to zero signifies that it is less likely to observe the annotation of a particular GO term to a group of genes occurs by chance. For comparison purpose we have computed the p-values of all the protein complexes identified by MCODE, MCL, clusterONE, RNSC and PPSampler2 algorithm. The distribution of p-values is given in Fig. 8. The left pane of the Fig. 8 shows the fraction of modules having p-value below certain value x while the right pane shows the distribution of p-values. As shown in the figure for any given p-value, the fraction of predicted complexes identified by the proposed technique is larger than the other methods.

Table 4 GO-terms, GO-id and p-value of top 15 identified complexes.

Full size table

Evaluation using aggregated ranking

The above experiments measure the performances based on various metrics and the results show that the different protein complex identification methods rank differently for different metrics. Thus it is difficult to comment on the best performing algorithm. In this section, we evaluate the performances of the methods using an aggregated ranking procedure. We use three topological measures namely density, betweenness centrality and clustering coefficient and a biological relevance measure namely GO based semantic similarity for ranking each complex. Subsequently, a rank aggregation scheme³⁶ is used to aggregate the individual ranks. A comparison between the predicted complexes identified by the proposed method and that by the clusterONE is reported in Table 5. The Table 5 shows the modules having top five ranks, their matched benchmark complexes and proportion of proteins involved in some benchmark complexes. As can be seen from the table that predicted complexes of the proposed method have higher proportion of involvement in the benchmark than clusterONE. From Table 5 it can be noticed that the complexes: COMPASS, histone H3 methyltransferase protein complex, Protein phosphatase 1 complex and Polyadenylation Factor I are detected by the top rank predicted complex-2. Figure 9 further shows the predicted complex and the proteins involved in three different benchmark complexes.

Table 5 Table shows the comparison of top rank five predicted complexes of proposed and clusterONE method. Second column represents benchmark complexes detected by the methods.

Full size table

Discussion

Identifying protein complexes from protein-protein interaction (PPI) data is an important problem in computational biology. The existing literature resulted from a cross-disciplinary research, proposes several methods based on global objective maximization, network motif identification, local search mechanisms etc. In this paper we propose a sequential non-cooperative game based model for protein complex detection from PPI network. We model protein complex identification as a non-cooperative sequential partial dense vertex cover game. The Nash equilibrium of the game corresponds to a minimum partial dense vertex cover of the given network.We carry out a thorough experiment for validation of topological and biological relevance of the identified complexes. The data sets used in the experiment are synthetic benchmarks and real life yeast PPI data. The experiments on synthetic benchmark network and null-models proves the accuracy of the proposed method. We further test robustness of the proposed model on perturbed synthetic benchmarks where edges are rewired with certain probability. The results prove that the developed model accurately identifies the network module even when the rewiring probability is moderately high. The tests on real PPI networks also shows promising results. The validation through connectivity density proves that the identified modules on the real data are dense indeed and good candidates for protein complexes. The partitions are further validated using matching score, sensitivity, positive predictive value and accuracy metrics against experimentally annotated data. Although the proposed method does not outperformed all other state-of-the-art with respect to the performance metrics, but it achieves a consistently good score for all of the datasets. It is noticeable that, although RNSC and MCL achieve higher accuracy than the proposed method in MIPS and KROGAN data, they are not performing well while comparing with respect to the functional similarity score. The experiment shows that the proposed method attains high functional similarity score for most of the identified modules. Additionally, from the experiment we find a common drawback with the existing approaches is that they find a few well-matched clusters and a high number of poorly matched clusters. The proposed approach overcome this issue and finds the clusters whose matching score is far less skewed compared to the existing approaches. Finally we measure the performance using an aggregated ranking approach which proves the superiority of the proposed method over the base-line methods. It is noticeable that the proposed approach is also capable to identify sparse, low dense protein complexes. Setting the input parameters α and β to a low value, the proposed method results low dense clusters which may be treated as possible candidates for sparse, low dense protein complexes.

Applying game theory for sub-cellular events like protein-protein interactions is a rather new approach of research. For simplicity, the proposed method is designed to detect the disjoint complexes. However, with the use of appropriate data structure and by removing the restriction that if a protein joins a coalition, then the protein is removed from the game, the proposed method can be extended to detect the overlapping complexes. In future we would like to study the weighted PPI networks for the analysis of protein complexes and functional modules.

References

Srihari, S. & Leong, H. W. A survey of computational methods for protein complex prediction from protein interaction networks. Journal of bioinformatics and computational biology 11, 1230002 (2013).
Article PubMed Google Scholar
Nepusz, T., Yu, H. & Paccanaro, A. Detecting overlapping protein complexes from protein-protein interaction networks. Nature Methods 9, 471–472 (2012).
Article CAS PubMed PubMed Central Google Scholar
Mirny, L. & Spirin, V. Protein complexes and functional modules in molecular networks. Proc. Natl Acad. Sci 100(21), 12123–12128 (2003).
Article ADS PubMed PubMed Central Google Scholar
Hartwell, L. H., Hopfield, J. J., Leibler, S. & Murray, A. W. From molecular to modular cell biology. Nature 402, C47–C51 (1999).
Article CAS PubMed Google Scholar
Zhang, B., Park, B.-H., Karpinets, T. & Samatova, N. F. From pull-down data to protein interaction networks and complexes with biological relevance. Bioinformatics 24, 979–986 (2008).
Article CAS PubMed Google Scholar
Srihari, S., Yong, C. H., Patil, A. & Wong, L. Methods for protein complex prediction and their contributions towards understanding the organisation, function and dynamics of complexes. FEBS letters 589, 2590–2602 (2015).
Article CAS PubMed Google Scholar
Pereira-Leal, J., Enright, A. & Ouzounis, C. Detection of functional modules from protein interaction networks. Proteins 54, 49–57 (2004).
Article CAS PubMed Google Scholar
Altaf-Ul-Amin, M., Shinbo, Y., Mihara, K., Kurokawa, K. & Kanaya, S. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics 7, 207 (2006).
Article PubMed PubMed Central Google Scholar
Brohee, S. & van Helden, J. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7, 471–488 (2006).
Article Google Scholar
Girvan, M. & Newman, M. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99, 7821–7826 (2002).
Article ADS MathSciNet CAS MATH Google Scholar
Bader, G. & Hogue, C. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 1471–2105 (2003).
Article Google Scholar
King, A. D., Przulj, N. & Jurisica, I. Protein complex prediction via cost-based clustering. Bioinformatics 20, 3013–3020 (2004).
Article CAS PubMed Google Scholar
Frey, B. J. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).
Article ADS MathSciNet CAS PubMed MATH Google Scholar
Mukhopadhyay, A., Ray, S. & De, M. Detecting protein complexes in a ppi network: a gene ontology based multi-objective evolutionary approach. Mol Biosyst. 8, 3036–48 (2012).
Article CAS PubMed Google Scholar
Bandyopadhyay, S., Ray, S., Mukhopadhyay, A. & Maulik, U. A multiobjective approach for identifying protein complexes and studying their association in multiple disorders. Algorithms for Molecular Biology 10, doi:10.1186/s13015–015–0056–2 (2015).
Widita, C. K. & Maruyama, O. Ppsampler2: Predicting protein complexes more accurately and efficiently by sampling. BMC systems biology 7, S14 (2013).
Article PubMed PubMed Central Google Scholar
Zaki, N., Efimov, D. & Berengueres, J. Protein complex detection using interaction reliability assessment and weighted clustering coefficient. BMC bioinformatics 14, doi:10.1186/1471–2105–14–163 (2013).
Tatsuke, D. & Maruyama, O. Sampling strategy for protein complex prediction using cluster size frequency. Gene 518, 152–158 (2013).
Article CAS PubMed Google Scholar
Bohl, K. et al. Evolutionary game theory: molecules as players. Mol Biosyst. 10, 3066–74 (2014).
Article CAS PubMed Google Scholar
Dongen, V. Graph clustering by flow simulation. Ph.D. thesis, University of Utrecht (2000).
Newman, M. & Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 (2004).
Article ADS CAS Google Scholar
Gavin, A. et al. Functional organisation of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147 (2002).
Article ADS CAS PubMed Google Scholar
Krogan, N., Cagney, G. & H. Y. et al. Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature 440, 637–643 (2006).
Article ADS CAS PubMed Google Scholar
Salwinski, L. et al. The database of interacting proteins: 2004 update. Nucleic Acids Res supl 1, 449–451 (2004).
Guldener, U. Cygd: the comprehensive yeast genome database. Nucleic Acids Res 33, 364–368 (2005).
Article Google Scholar
Ana, L. & Jain, A. Robust data clustering. In Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, II–128–II–133 vol.2 (2003).
Danon, L., Diaz-Guilera, A., Duch, J. & Arenas, A. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment 2005, P09008 (2005).
Article Google Scholar
Erdös, P. & Rényi, A. On the strength of connectedness of a random graph. Acta Mathematica Hungarica 12, 261–267 (1961).
Article MathSciNet MATH Google Scholar
Chen, J. & Yuan, B. Detecting functional modules in the yeast protein-protein interaction network. Bioinformatics 22, 2283–2290 (2006).
Article CAS PubMed Google Scholar
Brohee, S. & Helden, J. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics (2006).
Kiemer, L., Costa, S., Ueffing, M. & Cesareni, G. Wi-phi: A weighted yeast interactome enriched for direct physical interactions. Proteomics 7, 932–943 (2007).
Article CAS PubMed Google Scholar
Ray, S., De, M. & Mukhopadhyay, A. A multiobjective go based approach to protein complex detection. Procedia Technology 4, 555–560 (2012).
Article Google Scholar
Hossain, S. M. M., Mahboob, Z., Chowdhury, R., Sohel, A. & Ray, S. Protein complex detection in ppi network by identifying mutually exclusive protein-protein interactions. Procedia Computer Science 93, 1054–1060 (2016).
Article Google Scholar
Ray, S., Bandyopadhyay, S., Mukhopadhyay, A. & Maulik, U. Incorporating fuzzy semantic similarity measure in detecting human protein complexes in ppi network: A multiobjective approach. In Fuzzy Systems (FUZZ), 2013 IEEE International Conference on, 1–8 (IEEE, 2013).
Schlicker, A., Domingues, F., Rahnenfuhrer, J. & Lengauer, T. A new measure for functional similarity of gene products based on gene ontology. BMC bioinformatics 7, 302 (2006).
Article PubMed PubMed Central Google Scholar
Pihur, V., Datta, S. & Datta, S. Weighted rank aggregation of cluster validation measures: a monte carlo cross-entropy approach. Bioinformatics 23, 1607–1615 (2007).
Article CAS PubMed Google Scholar

Download references

Author information

Ujjwal Maulik, Srinka Basu and Sumanta Ray contributed equally to this work.

Authors and Affiliations

Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700108, India
Ujjwal Maulik
Department of Engineering and Technological Studies, University of Kalyani, Kalyani, West Bengal, India
Srinka Basu
Department of Computer Science and Engineering, Aliah University, Kolkata, 700156, India
Sumanta Ray

Authors

Ujjwal Maulik
View author publications
You can also search for this author in PubMed Google Scholar
Srinka Basu
View author publications
You can also search for this author in PubMed Google Scholar
Sumanta Ray
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.R. and S.B. did the initial planning, implement the code, write the draft. U.M. provided constructive discussions, write and review the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Sumanta Ray.

Ethics declarations

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Maulik, U., Basu, S. & Ray, S. Identifying protein complexes in PPI network using non-cooperative sequential game. Sci Rep 7, 8410 (2017). https://doi.org/10.1038/s41598-017-08760-x

Download citation

Received: 27 February 2017
Accepted: 13 July 2017
Published: 21 August 2017
DOI: https://doi.org/10.1038/s41598-017-08760-x

This article is cited by

Validation and quality assessment of macromolecular structures using complex network analysis
- Jure Pražnikar
- Miloš Tomić
- Dušan Turk
Scientific Reports (2019)
Detection of dynamic protein complexes through Markov Clustering based on Elephant Herd Optimization Approach
- R. Ranjani Rani
- D. Ramyachitra
- A. Brindhadevi
Scientific Reports (2019)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.