Cliques and duplication–divergence network growth

A population of complete subgraphs or cliques in a protein network model is studied. The network evolves via duplication and divergence supplemented with linking a certain fraction of target–replica vertex pairs. We derive a clique population distribution, which scales linearly with the size of the network and is in perfect agreement with numerical simulations. Fixing both parameters of the model so that the number of links and abundance of triangles are equal to those observed in the fruitfly protein-binding network, we precisely predict the 4- and 5-clique abundance. In addition, we show that such features as fat-tail degree distribution, various rates of average degree growth and non-averaging, revealed recently for a particular case of a completely asymmetric divergence, are present in a general case of arbitrary divergence.


I. INTRODUCTION
The duplication-divergence mechanism [1,2] of network growth is traditionally used to model protein networks: A duplication of a node is a consequence of the duplication of the corresponding gene, and a divergence or loss of redundant links or functions is a consequence of gene mutations [3,4,5,6].General properties of the duplicationdivergence growth have recently been studied for probably the simplest version of the duplication-divergence model which is the asymmetric divergence [7].Yet even this simplest model, where links are removed with a certain probability only from the replica node, turned out to have very rich phenomenology and to reproduce the degree distribution, observed in real protein-protein networks, surprisingly well.Overall, when the link removal probability is small, the network growth is not self-averaging and an average vertex degree is increasing algebraically.For larger values of the link removal probability, the growth is self-averaging, the average degree increases very slowly or tends to a constant, and a degree distribution has a power-law tail.
A natural next step in exploring properties of the duplication-divergence networks is to consider their modular structure and distribution of various subgraphs or motifs.Small subgraphs are often considered building blocks of network; densities of particular subgraphs may tell if a network belongs to a certain "superfamily" [8] or performs specific functions [9].Abundances of triangles and loops have been studied in the Internet, random and preferential attachment networks and regular scale-free graphs [10,11,12,13].Densities of small motifs and cycles centered on a vertex were considered as a function of the vertex degree and clustering coefficient in [14].In protein-protein networks, highly interconnected subgraphs were found to be well-conserved in evolution [15] and to correspond to functional protein modules in living cells [16].An extreme case of highly interconnected motifs are cliques, or completely connected subgraphs.Cliques have been found in higher than random abundances in protein-protein networks in yeast [16].
In the paper we consider a generalization of duplicationdivergence network growth mechanism, duplicationdivergence-heterodimerization.
The heterodimerization, or linking a certain number of the pairs of target and replica nodes, is essential for clustering and is observed in protein-protein networks [17].We show that duplicationdivergence-heterodimerization produces the cliques in the number very similar to those observed in protein-protein networks.
As in our previous work [7], we again start with the simplest case of the completely asymmetric divergence.Yet in real protein networks, apart from special cases of partially asymmetric divergence [18], the divergence is believed to be close to symmetric [19].It turns out that the asymmetric divergence results for the clique statistics as well as the previously obtained results for the network growth [7] are qualitatively similar to those in the arbitrary divergence case, where links are removed with given probabilities both from the target and replica nodes.
The paper consists of two principle parts: In the next section we derive the clique abundance distribution for the asymmetric case and compare it to the simulation and experimental results.In Sec.III we generalize these and previously obtained results for the network growth and degree distribution onto the arbitrary divergence case.A Discussion and Conclusion section completes the paper.

II. CLIQUES
Protein-protein networks exhibit a distinct modular structure and contain densely linked neighborhoods or complexes ( [16] and references therein).The extreme case of densely linked complexes are cliques or completely connected subgraphs where each vertex is connected to all other subset members.Cliques of the sizes of up to 14 vertices were found in much higher than "random" abundance in protein binding network of yeast [16].Yet many large cliques observed in protein networks may be artifacts of specific experimental techniques or even of misinterpretation of the experimental data.For example, there is a strong evidence that all cliques of order higher than six in the yeast interaction network [21] considered in [16] result from the "matrix" recording of the experimental data from the mass-spectrometry experiments.In such experiments an immunoprecipitation is used to isolate stable protein complexes.Usually a single protein is used as a target for the antibody; binding of the antibody to this protein leads to an isolation of the the entire complex.However, the precise pairwise binding between proteins in the complexes strictly speaking remains undefined if a complex contains more than two proteins.Yet in the "matrix" interpretation of the massspectrometry experiment all possible pairwise interactions between proteins in the complex are usually recorded.A wellknown example of such erroneous recording is the anaphasepromoting complex.It is reported as a 11-node clique in three different mass-spectrometry high-throughput interaction surveys of yeast genome and in the MIPS database [21].The biggest reported clique in yeast network, SAGA/TFIID complex [16], is also the result of erroneous "matrix" recording of the data from a co-immunoprecipitation experiment described in [20].
However, a virtually free of subjective interference twohybrid method, used to determine the protein binding network of fruit fly [22], yields also higher than "random" number of cliques.Specifically, the fly dataset contains 1405 triads, 35 4-cliques and one 5-clique, while a randomly re-wired graph of fly dataset contains only 1147 triads and 8 4-cliques [23].Here and below, the lower-oder cliques that comprise the higher-order ones (each clique with j vertices or "j-clique" consists of j cliques with j − 1 vertices which can be obtained by eliminating one of the j vertices) are counted along with the non-trivial cliques.The number of only non-trivial cliques is slightly lower; the fly dataset contains 1297 non-trivial triads, 30 4-cliques and one 5-clique.
Is such high concentration of large cliques caused by an evolutionary pressure that specifically favors big cliques, or by some stochastic mechanism of network evolution?Evidently, a simple duplication-divergence network growth never produces even a single triad as new duplicates are never linked to their ancestors [7].Random mutations, or re-wiring of some links will give rise to a certain number of cliques, yet their abundance will be much less than the experimentally observed one [16,23].However, in [17] it was concluded that links between paralogs (or recently duplicated pairs of proteins) are significantly more common than if such links appeared by random mutations.Most of these paralogous links are formed when a self-interacting protein or (homodimer) is duplicated [17], thus giving rise to a pair of interacting heterodimers.While after divergence certain pairs of heterodimers loose their ability to interact, some paralogs retain their propensity to heterodimerize.In the following we show that the simple duplication-divergence network growth complimented with heterodimerization of some pairs of duplicates does explain the observed abundance of cliques without invoking any evolutionary pressure.The duplication-divergence-heterodimerization process consists of duplication-divergence, previously introduced in [7], • Duplication.A randomly chosen target node is duplicated, that is its replica is introduced and connected to each neighbor of the target node.
• Divergence.Each link emanating from the replica is activated with probability σ (this mimics link disappearance during divergence).
and heterodimerization, • Heterodimerization.The target and replica nodes are linked with probability P .It mimics the probability that the target node is a dimer and the propensity for dimerization is preserved during divergence.
Similarly to the "pure" duplication-divergence growth [7], the replica is preserved if at least one link is established; otherwise the attempt is considered as a failure and the network does not change.
Let us first consider an evolution of population of triads, or 3-cliques.Two processes that give rise to new triads are illustrated in Figs.1,2.During the first process a target vertex 1, initially linked to the vertices 2 and 3, is duplicated to produce a new vertex 4. The resulting pair of duplicates 1 and 4 have a probability P to be linked.In addition, links 4-2 and 4-3 are inherited with the probability σ each.As a result of this process, two new triads 1-4-2 and 1-4-3 are formed, each with probability P σ.In the second process (Fig. 2) a new triad is produced from the existing one when one of its vertices (vertex 1) is duplicated.The new triad is formed only if both links, 4-2 and 4-3 survive divergence, which happens with the probability σ 2 .
Correspondingly, a rate equation for the increase in the number of triads C 3 per duplication-divergenceheterodimerization step contains two terms, where L and N are the numbers of links and vertices in the network.The fraction 2L/N in the first term is an average number of links picked up for a potential triad (which is also equal to the average degree d ).The factor 3 in the second term indicates that each of the three vertices in the existing triad can be picked up as a target vertex for duplication.Considering links as 2-cliques, the first term in Eq. ( 1) can be interpreted as describing a creation of 3-clique from a lower-order 2-clique.It is easy to see that with such interpretation, the Eq. ( 1) can be generalized to describe the evolution of population of cliques of an arbitrary order, Here ν ≤ 1 is an increment in the number of vertices per duplication step.In the following we focus on a biologicallyrelevant regime of 0 < σ < 1/2 where the average degree d is constant or almost constant [7].In this regime ν = 2σ, and assuming scaling for C j , C j ≡ N c j , one obtains a recurrent relation for the rescaled j-clique abundance, For large j the second term in denominator becomes subdominant, -fruit fly protein-protein binding network, C s j -simulations with σ = 0.38 and P = 0.03, and C th j -Eq.( 3) prediction for the same σ and P .
It follows that the relative population of large cliques decays faster than exponentially.This rends large cliques highly improbable in networks of biologically relevant size of N ∼ 10 4 .
To check this analytical prediction and to see if the proposed duplication-divergence-heterodimerization model explains the observed population of cliques, we performed the following numerical simulation.As in [7], we fix σ = 0.38 so that the average degree is equal to that of the fly dataset, where d ≈ 5.9 for N = 6954 proteins [22].We select P = 0.03 so that the number of triads in the simulated network is also similar to that in the fly dataset and count the number of 4-and 5-cliques in the resulting network.The theoretical C j are computed for the same σ and P taking into account that c 2 ≡ d /2.Results of simulations C s j averaged over 2000 network realizations, the computed C th j , and the clique abundances in the fly dataset C f ly j are shown in Table I.The agreement between the experimental dataset, simulations, and Eq. 3 is surprisingly good, especially given the fact that in for σ = 0.38, d = const only approximately [7].

III. SYMMETRIC DIVERGENCE
In this section we generalize the results obtained in [7] and above for the case of completely asymmetric divergence onto an arbitrary divergence case.The arbitrary divergence model is defined as follows: 1. Duplication.A randomly chosen target node is duplicated, that is, its replica is introduced and connected to all neighbors of the target node.
2. Divergence.Each link emanating from either the target or the replica node is independently removed with probability 1 − σ 1 and 1 − σ 2 , correspondingly.This mimics disappearance of links during divergence from initially indistinguishable target and replica nodes.Vertices that lost all their links during this process (this may include both the target and the replica vertices as well as their neighbors) are discarded.
Unlike in the asymmetric duplication-mutation models, the symmetric growth model may generate network consisting of more than one disconnected components.Vázquez and co-workers [4] investigated a symmetric model which only slightly differs from the fully symmetric version (σ 1 = σ 2 ) of our model.

A. Growth law
As in [7], an increment in the number of links L during a duplication step is, where N is the number of vertices, 2L/N ≡ k is the average number of neighbors or the average degree, and 0 < ν ≤ 1 is an increment in the number of vertices per step.Assuming that for a large network ν does not depend on the network size N , we obtain, As in the asymmetric case, there exist three distinct regimes: • Since at a duplication step the number of vertices cannot increase by more than one, ν ≤ 1 and for σ 1 +σ 2 > 3/2 the growth of L(N ) is superliniear.The average degree grows as a power-law of a network size, and for sufficiently large networks the probability to eliminate all the links and therefore, not to add a vertex at a duplication step becomes negligible.Hence for large networks ν → 1 and • For σ 1 + σ 2 ≤ 3/4 and σ 1 > σ * 1 , σ 2 > σ * 2 (where the lower bounds σ * i will be determined below), we observe that the average degree increases logarithmically and L ∼ N ln(N ).
• Since only linked vertices are counted, the average degree cannot degrease below unity.Hence even for small link retention probability, 1 < σ 1 + σ 2 , and σ 1 < σ * 1 , σ 2 < σ * 2 the growth of L is linear, L ∼ N and the average degree saturates to a constant.

B. Degree distribution
As in [7], the degree distribution N k is described by the following rate equation, Here the first three terms describe the gain of two new degrees of the duplicated vertices and the loss of an old degree of the target vertex, while the fourth term accounts for a change in the number of degrees of a neighbor of a target vertex.Substituting N k ∝ N k −γ and using ν = 2(σ 1 + σ 2 − 1) (which follows from ( 5)), we obtain This equation has a trivial γ ′ = 2 and a non-trivial solution γ(σ 1 , σ 2 ) which intersect at (σ * 1 , σ * 2 ) that satisfy the equation, 0.36879 [7].The resulting exponent γ for the symmetric case is plotted in Fig. 5.The measured in simulation degree distribution for 1/2 < σ < σ * indeed follows the predicted powerlaw asymptotics, Fig. 6.A summary of results for the arbitrary-symmetric duplication-divergence is presented in Table II.

C. Cliques
Similarly to the asymmetric divergence considered above, to generate cliques one needs to add heterodimerization to the pure duplication and divergence.Hence we assume that a target and a replica nodes are linked with probability P .
A generalization of Eq. ( 2) reads Since a creation of a new clique requires that all links emanating both from the target and replica vertices survive divergence, in the first two terms σ is replaced by σ 1 σ 2 .the third term accounts for loss of j-cliques due to disappearance of at least one link both from the target and replica nodes.Following the procedure for the asymmetric case and taking into account that in the scaling regime where , we obtain the recurrent relation (an analog of Eq. ( 3)), We check this prediction for a completely symmetric case σ 1 = σ 2 = σ, again using the fly dataset [22] for reference.The correct average degree and number of triads are obtained when σ ≈ 0.725 and P ≈ 0.0475.The experimental, simulation, and theoretical results, shown in Table III, are again in very good agreement.

D. Integrity of the network
For symmetric divergence, we measure the number of components and the size of the largest component for the networks grown with various σ 1 = σ 2 = σ.The results for the networks of the size of fruit fly dataset, N = 6954, are presented in Table IV.
It follows that for 1/2 < σ σ * the grown network consists of many fairly small components, while for σ * < σ there is usually one or few large components and several small ones.Intuitively it is clear that if the average degree grows, even slowly, the probability to split the network into many parts becomes small.
A theoretical prediction for the size of the giant component exists only for the Erdős-Rényi random graph [24]: When the average degree scales logarithmically with the number of vertices, i.e., d = p ln N , the total number of vertices that do not belong to the giant component scales as N 1−p for p < 1, while for p > 1 the giant component engulfs the entire system.It turns out that for the same number of vertices and links, the completely random linking of the Erdős-Rényi graph keeps more vertices in a giant component than the corresponding duplication-symmetric divergence network.Indeed, for the parameters corresponding to the fly dataset, σ 1 = σ 2 = 0.725, N = 6954, d ≈ 5.9, and p = 0.667, the number of vertices not belonging to the giant component is 6954 × 0.08 ≈ 556 (see Table IV).Yet the Erdős-Rényi graph with the same number of vertices and links has only ∼ 6954 0.333 ≈ 19 vertices outside of its giant component.This happens mainly because in our duplication-divergence growth model, once a component is split from the giant component, it never re-connects.
TABLE II: The behavior of the duplication-divergence network of arbitrary symmetry for different values of probabilities to preserve a link σ1 and σ2.Here L(N ) is the average number of links for given number of nodes N , n k the average fraction of nodes of degree k. σ * i , i = 1, 2 are the solutions of Eq. ( 10), γ(σ1, σ2) is given by Eq. ( 9).If such separation happens at an early stage of the network growth, the separated component may grow to a significant size, thus leaving many vertices outside of the giant component.On contrary, at each step of the Erdős-Rényi growth, any two components can be united with a random link.This makes the co-existence of two or more large components very unprobable.

IV. DISCUSSION AND CONCLUSION
In the previous sections the following conclusions on the clique abundances and growth laws of the duplicationdivergence-heterodimerization networks have been made: • We showed that the duplication-divergence network growth model, complimented with heterodimerization links between duplicates, correctly describes the statistics of cliques in biologically observed proteinprotein networks.
We derive an expression for clique population distribution that correctly describes the clique abundances in the duplication-divergenceheterodimerization networks.
• Generalizing the results obtained for the completely asymmetric divergence in [7], we demonstrated that similar regimes, such as presence and lack of selfaveraging, growth and saturation of the average degree, scaling and fat tail in the degree distribution, exist in general duplication-divergence case as well.In addition, a clique density distribution is generalized onto the arbitrary divergence scenario.
The heterodimerization links are not taken into account in our description of the network growth and degree distribution.
Despite their crucial role in the network topology and clique formation, they constitute only about 1% of all links and do not contribute significantly to the degrees of the most of the vertices.For link inheritance and heterodimerization probabilities σ and P , corresponding to the fly dataset, the resulting number of heterodimeric links in a network of the size of the fly dataset is L hd ≈ P N/(2σ) ≈ 270.This is somewhat higher than the observed number of links between the pairs of recently duplicated (paralogous) proteins L f ly hd = 142 [17].The main reason for this discrepancy is that in our simulation all heterodimeric links are counted, while in the real protein network one can reliably identify only the pairs of recently duplicated proteins.
In a case of not completely asymmetric divergence when links can disappear both from the target and replica nodes, a network may fragment into several components.Yet the biological protein networks are believed to be connected to ensure their functionality.Hence during in vivo divergence the steps that lead to breaking the network into isolated components are excluded due to evolutionary pressure.Our probabilistic network growth model does not take any evolutionary pressure into account.However, since for sufficiently high link retention probabilities the resulting network consists of one or very few large components, the number of link eliminations that have to be evolutionally overridden is small.Hence most of the properties of the probabilistically grown graphs should be similar to those of the realistic evolutionary singlecomponent networks.As the link inheritance probabilities σ i decrease and the number of network components grow, the number of link removal steps that have to be evolutionally overridden becomes large.Consequently, the probabilistic multi-component network becomes less similar to the real single-components one.
As we mentioned in the Section II, we selected the fly dataset as an example as being the most non-subjective one.Other know protein-protein networks, such as for Yeast, Worm, and Human, do contain parts of data that are results of the "matrix" recording of the experimental data from the immunoprecipitation experiments.These datasets contain a higher number of large cliques which can be attributed to this data interpretation.In principle, the clique population distribution derived here can be used to verify and filter the experimental datasets, removing the erroneously recordered large cliques.
In a recent publication, Middendorf et al [25] compared topological properties of the fly dataset to those of the networks grown by several mechanisms such as different versions of duplication-mutation model and preferential attachment.It was found that a duplication-mutationcomplementation network provides the best fit to the fly dataset.The duplication-mutation-complementation network growth model is very close to the duplication-divergenceheterodimerization model studies here.Complementation is equivalent to heterodimerization, the only difference between two models is in the way the links are deleted during divergence (or mutation): Unlike our model, in [25] each neighbor remains connected to at least one of the two dupli-cates.Thus we confirmed the conclusions made in [25] that practically all considered properties of protein-protein networks are very well described by the duplication-divergenceheterodimerization model.
And finally a few words on the importance of heterodimerization links in clique formation.An alternative to heterodimerization way to connect paralogs is to link them randomly by "mutation" links.In this case the probability to establish a heterodimeric link P has to be replaced by a probability that a mutation link, emanating from a target node, selects the replica node out of N network nodes.This probability is equal to M/N where M is the number of mutation links established at each duplication step.In the example of the fruit fly dataset where P = 0.03 and N = 6954, one needs M = N P = 209 random links at each step to form the correct number of triads and higher cliques.Obviously, the mutation scenario which requires so many additional links is completely ruled out due to, for example, average degree constraint.

FIG. 1 :
FIG.1:A sketch of duplication event when a new triad is formed with a heterodimerization link.Solid lines correspond to the existing links, dotted line is a heterodimerization link, established with the probability P , and dashed lines denote the inherited with probability σ links.

FIG. 2 :
FIG.2:A sketch of duplication event when a new triad is formed by duplicating the existing one.Solid lines correspond to the existing links and dashed lines denote the links, each inherited with the probability σ.

TABLE III :
Number of j-cliques in networks with N = 6954 vertices and L = 20435 links for C f ly

TABLE IV :
Number of components nc and the number of vertices in the largest component normalized by the network size, NL/N , in the duplication-symmetric divergence networks for various σ1 = σ2 = σ.All networks are grown to the fly dataset size, N = 6954; the results are averaged over 1000 realizations.