Exploring triad-rich substructures by graph-theoretic characterizations in complex networks

One of the most important problems in complex networks is how to detect metadata groups accurately. The main challenge lies in the fact that traditional structural communities do not always capture the intrinsic features of metadata groups. Motivated by the observation that metadata groups in PPI networks tend to consist of an abundance of interacting triad motifs, we define a 2-club substructure with diameter 2 which possessing triad-rich property to describe a metadata group. Based on the triad-rich substructure, we design a DIVision Algorithm using our proposed edge Niche Centrality DIVANC to detect metadata groups effectively in complex networks. We also extend DIVANC to detect overlapping metadata groups by proposing a simple 2-hop overlapping strategy. To verify the effectiveness of triad-rich substructures, we compare DIVANC with existing algorithms on PPI networks, LFR synthetic networks and football networks. The experimental results show that DIVANC outperforms most other algorithms significantly and, in particular, can detect sparse metadata groups.


Introduction
One of the most important problems in complex networks is how to detect metadata groups accurately [1].Metadata groups are the subsets of vertices with real physical sense.For example, in biological networks they are referred to as various biological functional modules such as protein complexes, GO terms and pathways; in social networks, metadata groups may be various social circles such as groups of people with common interests, etc. Traditional structural communities, which are typically described as dense subgraphs (subnetworks) explicitly or implicitly, are usually used to capture the intuition of metadata groups.The underlying assumption is that objects in some metadata groups really tend to interact more frequently than in other regions of the network.Around the issue of how to detect metadata groups, scholars have proposed many popular structural community detection algorithms which can identify parts of metadata groups successfully at a certain degree.Examples of algorithms that detect metadata groups by dense subnetworks include (i) random-walk based methods such as MCL [2] and INFOMAP [3]; (ii) seed-growing methods such as MCODE [4] and ClusterOne [5]; (iii) algorithms based on clustering, optimization, or statistical techniques such as LinkComm [6], LOUVAIN [7], and OSLOM [8]; and (iv) algorithms based on deeper graph-theoretic features such as EPCA [9][10][11].
While detecting traditional structural communities can offer some insight into some of the structure of metadata groups, more and more recent studies show that these intuitions about traditional structural communities are unreliable [12][13][14][15][16].Some of these examples include: perhaps most profoundly, metadata groups do not necessarily coincide with traditional structural communities [13][14][15][16]; overlapping communities have a higher density of links in the overlapping parts than in the non-overlapping ones, which are in contrast with the common picture of traditional structural communities [16]; there is a paradox that the detection of welldefined communities is more difficult than the identification of ill-defined communities [12].All of these counterintuitive evidences hint at the necessity of modifying the general defining characteristics of traditional structural communities.While there is a general consensus on the fact that there is a need for an adjustment of the notion of community or clusters, there is no clear direction to a remedy.Scholars [15] point out that there are two possible scenarios for filling the gaps between traditional structural communities and metadata groups.One is to include additional topological features in refining the definitions of traditional structural communities beyond the standard measures of link density, degree correlations or density of loops, etc.; the other is to add requirements based non-topological knowledge, such as domainspecific background knowledge [17][18][19] for the detection of metadata groups.However, in the former case, solely adjusting the structural conditions sought for may still not obtain satisfying results as the essence of metadata groups in all contexts may not be characterized by equivalent topology.In the latter case, adding various domain-specific background knowledge may be effective on a limited number of cases, but the reliance on rigid domain-specific knowledge makes the resulting algorithms unlikely to exhibit scalability or transferability to other domains.An ideal paradigm for fully analyzing metadata groups would include identification of metadata groups via certain specific intrinsic features, combined with a method for capturing deeper domain-specific structure in a general topological framework on which one can further develop algorithms.Here, we develop such a new framework by incorporating a novel and more subtle assumption based on graph-theoretic properties of metadata groups and design efficient computing procedures to detect non-overlapping and overlapping substructures that have the desired properties.As shown in figures 1(a) and 1(b), both of the metadata groups 'nuclear origin of replication recognition complex' [20] (dense) and 'GID complex' [21] (sparse) consist of abundantly interacting triad motifs, for instance.More details about the two complexes can be found in Appendix A. Motivated by the observation that metadata groups in a PPI network are either quite dense or quite sparse and tend to consist of an abundance of interacting triad motifs [22][23][24][25], we define a 2-club substructure with diameter 2 which possessing triad-rich property to describe a metadata group.Based on the triad-rich substructure, we design a DIVision Algorithm using our proposed edge Niche Centrality DIVANC to detect metadata groups effectively in complex networks.We also extend DIVANC to detect overlapping metadata groups by proposing a simple 2-hop overlapping strategy.To verify the effectiveness of triad-rich substructures, we compare DIVANC with existing algorithms on PPI networks, LFR synthetic networks and football networks.The experimental results show that DIVANC outperforms most other algorithms significantly and, in particular, can detect sparse metadata groups.
The rest of the paper is organized as follows.In Section 2, we present our framework for detecting candidate metadata groups.After discussing our datasets and providing statistical evidence that motivates and supports our triad-rich assumption about metadata groups in Sections 2.1 and 2.2, we give a formal definition of a 2-club substructure in Section 2.3.In Section 2.4, we discuss the details of our algorithm for 2-club substructure detection, including a new edge-centrality measure specifically designed for 2-club substructures as well as a 2-hopbased strategy for extracting overlapping 2-club substructures.In Section 3, we report and discuss our experimental results.In Section 4, we conclude the paper and give closing discussion.

The datasets
We apply our framework on PPI networks [26,27], LFR synthetic networks [28,29] and football networks [30,31].In the following, we give details about the relative networks and six golden standard sets of metadata groups in PPI networks, respectively.
S.cerevisiae PPI networks (SceDIP) are obtained from DIP [26] and H.sapiens PPI networks (HsaHPRD) are extracted from HPRD [27].For SceDIP, we use the sets from the Munich Information Center for Protein Sequences (MIPS) [32], Saccharomyces Genome Database (SGD) [33] and S.cerevisiae GO terms (Sce GO term) as golden standards [34,35].For HsaHPRD, we use the sets of Human Protein Complex Database with a Complex Quality Index (PCDq) [36], Comprehensive Resource of Mammalian Protein Complexes (CORUM) [37] and H.sapiens GO terms (Hsa GO term) [34,35] as golden standards.SceDIP consists of 4980 proteins and 22076 interactions; HsaHPRD consists of 9269 proteins and 36917 interactions.The GO terms are not composed of all the terms but the high-level GO terms whose information content is more than 2 [34,35].The definition of the information content as given in the literature [34], where 'root' is the corresponding root GO terms across the three aspects of molecular function (MF), biological process (BP) or cellular component (CC)) of g .In addition, the GO terms with less than 2 proteins are removed.We also remove the protein complexes or GO terms of which no members appear in the corresponding PPI networks.Last, MIPS consists of 203 and SGD has 305 protein complexes, while PCDq includes 1204 and CORUM has 1294 complexes.Additionally, there are 1050 terms in Sce GO term, and 4457 terms in Hsa GO term.
Here we also give the details of LFR synthetic networks [28,29].The parameters of the series of LFR networks are: vertices size 1000 N  , average degree 15 k  , minimum community size 20 minc  , maximum community size 50 maxc  , the mixing parameter mu with a step of 0.1 from 0.1 to 1.0, and for overlapping LFR networks, additional parameters such as number of overlapping vertices 100 on  , number of memberships of the overlapping vertices 2 om  .The parameters were chosen to follow the examples provided by the original code and we downloaded it at http://santo.fortunato.googlepages.com/inthepress2.
Football network [30,31] represents the relationships played among college teams during the year 2001 football season of the USA, and consists of 115 vertices and 613 edges, indicating 115 teams and 613 games played against each other.The 115 teams are grouped into 11 conferences, with a 12th group of independent teams.

Triad-rich substructures: a novel assumption on metadata groups
As mentioned in Section 1, most existing community detection algorithms are (explicitly or implicitly) based on the assumption that metadata groups most likely appear in dense subnetworks.This edge-rich assumption results in two fundamental difficulties that make it hard, if not impossible, to improve the performance of those methods that detect metadata groups by extracting dense subnetworks: (1) the requirement for a subnetwork to be highly dense is too strong; and (2) a pure density-based measure cannot distinguish among subnetworks that have different internal structures that may be of physical significance.
To further evidence the rough assumption that metadata groups most likely appear in dense subnetworks is not comprehensive enough, we quantitatively analyze the density distributions of the metadata groups among six golden standard sets in corresponding PPI networks, for instance.The numbers of elements in golden standard sets of PPI networks are described in table 1.As shown in figure 2, we demonstrate the percentages of metadata groups among their whole golden standard sets of SGD, MIPS, Sce GO term, PCDq, CORUM and Hsa GO term according to their density distributions, respectively.We consider the percentages of metadata groups with their densities 0, greater than 0 but no more than 0.1, greater than 0.1 but no more than 0.2, …, densities 1 but sizes greater than 2, densities 1 but sizes 2 respectively, as described in the legend of figure 2. Here, we demonstrate the percentages of those with densities 1 but sizes 2 since although these metadata groups seem very dense, they are merely paths with two vertices.As shown in figure 2, there are only a small number of metadata groups with high density in the golden standard sets of SGD, MIPS, Sce GO term, PCDq, CORUM and Hsa GO term.
Motivated by the observation that metadata groups (either dense or sparse) in a PPI tend to consist of abundantly interacting triad motifs [22][23][24][25], we propose that a metadata groups detection method shall be based on the following triad-rich assumption: metadata groups are most likely to occur in the substructures that contain many interacting triads.A motif in complex networks is a pattern of subnetworks on a small number of vertices that occur at a significantly higher frequency than what is expected in a random network with similar network statistics.In this paper, we will focus on the most basic blocks of triad motif [38] that consist of 3-vertices and two links, as depicted in figure 1(c).This is because (i) a 2-vertices motif is nothing but an edge, and is trivial; and (ii) motifs containing more than three vertices can be constructed from interacting triad motifs.
The triad-rich assumption naturally generalizes the edge-rich assumption in that a dense subnetwork is triad-rich.For example, the 'Nuclear origin of replication recognition complex' shown in figure 1(a) is a clique with the largest density, it is not hard to see that the complex contains many interacting triad motifs.However, a triad-rich subnetwork is not necessarily edge-rich, making it possible to detect metadata groups that are not necessarily dense.For example, the 'GID complex' shown in figure 1(b) is a star subnetwork with the lowest density but containing many interacting triad motifs.To quantify the property of being triad-rich, we impose the requirement that every pair of vertices in a metadata group participates in at least one triadic interaction.This leads to the graph-theoretic definition of a triad-rich metadata group as a substructure with diameter 2 and triad-rich property (i.e., a 2-club).This definition of a triad-rich substructure makes it possible to study interesting internal structures of metadata groups, which cannot be distinguished by any density-mainly measure.
To further support the proposed triad-rich assumption that metadata groups are more likely to occur in substructures that contain many interacting triads, we give a basic statistical analysis on the distribution of triad motifs in metadata groups on PPI networks.For PPI networks, we use the above six golden standard sets described in Section 2.1 as metadata groups, we compare the frequencies of triad motifs in metadata groups to those of a random selection of equallysized subnetworks.Our process is as follows: we first count the number of triads existing in metadata groups.For each of the metadata groups, if it contains n vertices, we randomly choose a set of n vertices in the corresponding PPI network and count the triad motifs among those randomly-selected vertices.We repeat this random selection one thousand times and calculate the corresponding average triad number.Thus for each golden standards, we obtain a pair of vectors, one of which indicates the triad numbers for the metadata groups and the other indicating the average triad numbers for the subnetworks obtained by randomly choosing vertices.The dimensions of the vectors are equal to the numbers of metadata groups in their golden standard sets.For instance, for the MIPS, we have a vector of 203 values of the true counts of triad motifs in the 203 complexes, along with a vector of triad counts in randomly generated subnetworks of equal size to each of the complexes.To test the statistical significance for the triad distributions among each golden standard sets, we calculate the corresponding Pvalues based on a T-test by comparing the number of triads obtained in the metadata groups to the numbers obtained in the randomly-selected equal-sized subnetworks.The lower P-values mean the more significant triad distribution in metadata groups.We display the corresponding P-values for MIPS, SGD, PCDq, CORUM, Sce GO term and Hsa GO term in table 1 respectively, where it is readily seen that the randomly-selected subnetworks have statistically fewer triad motifs than the true benchmarks.Thus, triad motifs are distributed far more densely in metadata groups than in randomly-selected subnetworks.This result reinforces our proposed novel assumption that metadata groups consist of abundantly interacted triad motifs.

A graph-theoretic definition of triad-rich substructures
In this section, we mainly introduce relative terminologies and our graph-theoretic definition of a triad-rich substructure.

Terminologies and concepts in graph theory
of a vertex set V and an edge set E .An induced subgraph of a graph is specified by a set of vertices, and all of the edges that exist on those vertices in the network are also part of the induced subgraph [39].A P4 is an induced graph on four ordered vertices, which are connected as a simple path [11,39].The distance between two vertices is the length (i.e., the number of edges) of a shortest path between them.The diameter of a graph is the maximum distance between a pair of vertices.

2-club substructures.
We define a triad-rich substructure to be an induced and connected substructure where every pair of vertices participate in at least one triadic interaction.It is not hard to see that a substructure is a triad-rich one if and only if it is an induced subgraph of diameter 2. Noting that a diameter-2 induced subgraph is also known as a 2-club in the social network literature, we shall call our triad-rich substructure a 2-club substructure.In a 2-club substructure, every pair of vertices either form an edge or are contained in at least one triad motif and in fact, 2-club substructures play the same role in the class of triad-rich subnetworks as cliques do in the class of edge-rich subnetworks.

DIVANC: a division algorithm for finding 2-club substructures
Given the definition of a triad-rich substructure as a 2-club, a natural algorithmic problem is to delete the minimum number of edges to modify a network into a new network where each connected component has diameter 2. While we have not been able to give a formal proof, we believe that this problem is NP-hard, similar to the many NP-hard edge-deletion problems.In this section, we develop a new centrality measure to approximate the requirement of being of diameter 2, and design effective and efficient algorithm for detecting both non-overlapping and overlapping 2-club substructures.Our algorithm is an edge division algorithm that removes edges according to a new edge centrality measure, called the edge niche centrality, specifically designed to capture the properties of 2-club substructures.

Edge niche centrality.
In their seminal work, Girvan and Newman [30] proposed an edgedivision algorithm to detect communities by iteratively removing edges with high edge betweenness centrality.One of the issues with the G-N algorithm is the high time complexity of computing the edge betweenness centrality, even though there are polynomial time algorithms for it.Recently, a few easy-to-compute centrality measures have been proposed and used to design more efficient edge-division algorithms, including the P4 centrality [11], antitriangle centrality [40], and the edge clustering coefficient [41].
The new edge-centrality measure, edge niche centrality, measures the importance of an edge by taking into consideration the edge's P4 centrality and embeddedness (revealed by edge clustering coefficient).In the following, we give the formal definition of our edge niche centrality.Let ( , ) G V E  be an undirected and unweighted network and ij e be an arbitrary edge in G , the niche centrality where ij C  represents the edge P4 centrality defined as the number of P4s which ij e belongs to, and can be calculated by the function IsP4 (a, b, c, d) provided in the reference [11] simply; ij C  is the number of triangles which ij e belongs to, representing the embeddedness of ij e (i.e., the number of common neighbors of vertices i v and j v ); i k ( j k ) denotes the degree As shown in Equation ( 1), two factors are considered in the edge niche centrality.The P4 centrality, helps in identifying edges that participate in many induced paths of length 3. Removing edges with high P4 centrality helps separating vertices that have distance greater than 2 and therefore, are not likely to be in the same 2-club substructure.The second term distinguishes edges that have similar P4 centrality, but have different embeddedness.This definition of edge niche centrality gives us a way to quantitatively measure the extent to which an edge is an inter-link or intra-link.If its niche centrality is large the edge is more likely to be an inter-link, while if its niche centrality is smaller it is more likely to be an intra-link.

The 2-hop overlapping strategy.
As edge-division algorithms can only detect nonoverlapping substructures, we propose a strategy, the 2-hop overlapping strategy, to uncover 2club substructures that may overlap.Our strategy is inspired by the idea of overlapping communities in [42].It searches eligible peripheral vertices and adds them into non-overlapping substructures to obtain the corresponding overlapping 2-club substructures.The criterion used to add a peripheral vertex is based on its closeness to a 2-club substructure.Formally, for a given non-overlapping 2-club substructure , the set of vertices to be added into where and Note that the new subnetwork on the vertex set () substructure.This is because that any vertex in () M simultaneously since we have , where the vertices 1 () The effectiveness in practice of DIVANC will be reported in Section 3. In the following, we discuss its worst-case complexity.Let k be the average degree of the vertices and T the number of edges removed.If the overlapping step is not performed, the time complexity of where the first term is the time to compute the edge niche centrality for all edges and the second term is the time to remove the T edges.When the overlapping step is performed, the total running time is Since general practical networks usually have a small average degree k and since T is at most the sum of ( 1) EV  and the number of detected 2-club substructures, the running time of DIVANC is very low, thus it is very efficiently.We note that neither T nor the number of detected 2club substructures is a parameter of the algorithm.For each of the current connected components do 8: Apply the 2-hop overlapping strategy; 9: End for 10:End if

Experiments and analyses
We report our experiment results and their analyses in this section.In Section 3.1 we compare DIVANC with existing algorithms in the literature on the PPI networks, LFR synthetic networks and football networks to verify the effectiveness of triad-rich substructures.We show the advantage of DIVANC in detecting sparse metadata groups in Section 3.2.In Section 3.3 we test the practical performance of our proposed edge niche centrality and 2-hop overlapping strategy.

Verifying the effectiveness of triad-rich substructures
In this section we mainly compare DIVANC with other widely-used reference algorithms to test the effectiveness of triad-rich substructures.To compare fairly, we select corresponding competing non-overlapping and overlapping algorithms respectively since DIVANC can also be extended into overlapping version, which is denoted as DIVANC' temporarily for comparison.Other than the well-known community detection algorithms, we also choose some excellent domain-specific algorithms (detecting protein complexes) such as COACH [43] and ClusterOne [5].
Among the non-overlapping algorithms, we freely downloaded the Cytoscape plugin for MCODE [4] at http://www.cytoscape.org/.We implemented INFOMAP [3] freely by the R package igraph [44].We obtained the source code for MCL at http://www.micans.org/mcl.We obtained the code of MATLAB version of LOUVAIN [7] at http://perso.uclouvain.be/vincent.blondel/research/louvain.html.EPCA [11] for detecting the defined cograph communities based on the edge P4 centrality, which is one critical component of the edge niche centrality proposed in this paper and we have the code.Especially, in order to verify the delicate advantages of edge niche centrality over than edge P4 centrality, we scrabble up an especial edge division algorithm based on edge P4 centrality to detect 2-club substructures (temporarily denoted as EPD2, Edge P4 centrality and Diameter 2 stop criterion)) like our DIVANC.Thus, the different effectiveness of DIVANC and EPD2 can be just due to their own different edge centralities.
While, as for the overlapping algorithms, we freely downloaded the Cytoscape plugin for ClusterOne [5] at http://www.cytoscape.org/.We obtained the executable program for COACH [43] at http://www1.i2r.a-star.edu.sg/~xlli/.We used LinkComm [6] by its R package [45].We made use of its fast version OSLOM2 [8] at http://oslom.org/software.htm.We set all the corresponding parameters of those competing algorithms at their respective default values as they report that the algorithms can obtain best performances under default parameter values.Especially inspired by scrabbling up the special algorithm EPD2 among the non-overlapping algorithms, in this section we further extend EPD2 into its overlapping version based the proposed 2-hop overlapping strategy, which is denoted as EPD2'.Introducing EPD2' as a competing algorithm can not only provide further comparative perspective between edge niche centrality and P4 centrality in an overlapping context, but also can verify the portability of the 2-hop overlapping strategy.
The effectiveness of those algorithms are evaluated using a series of indices in terms of protein complex detection and GO term detection.We use the indices of the numbers of matching metadata groups, the cluster-wise sensitivity (Sn), cluster-wise positive predictive value (PPV), the accuracy score (Acc), maximum matching ratio (MMR) [5,43,46] to assess the algorithms in complexes detection.F-measure and Percentage of matched GO terms and MMR are used to assess them in identifying GO terms [34,35].More details about the used indices can be found in Appendix B.

Comparison in detecting protein complexes and GO terms on PPI networks.
The results on the effectiveness for detecting protein complexes of non-overlapping algorithms are summarized in table 3 and overlapping ones in table 4.Among the indices, we mainly pay more attention to the three indices: numbers of candidate complexes which can match at least one reference complex among golden standards (NMC), the accuracy scores (Acc) and maximum matching ratio (MMR), as given in bold fonts in tables 3 and 4. In addition to comparing them in detecting protein complexes, we also compare their effectiveness in detecting GO terms.We test the compared algorithms for detecting GO terms from SceDIP and HsaHPRD using the indices of F-measure, percentage of matched GO terms and maximum matching ratio.Figures 4(a-c) show the indices of F-measure, percentage of matched GO terms and maximum matching ratio for non-overlapping algorithms on SceDIP and HsaHPRD respectively.Figures 4(d-f) display the corresponding indices for overlapping algorithms.
As shown in table 3, among the non-overlapping algorithms, DIVANC has the largest numbers of matched protein complexes across all the golden standards except PCDq.Where it has 377 matched protein complexes, which is almost equal to the highest number 378.The maximum matching ratios of DIVANC are the highest one on SGD and CORUM, and while across MIPS and PCDq the maximum matching ratios of DIVANC are very close to the highest ones.The accuracy scores of DIVANC are also very close to their highest ones such as those of MCL, EPCA and EPD2.As demonstrated in table 4, among the overlapping algorithms, DIVANC' has the largest accuracy scores across all the golden standards.Except on PCDq DIVANC has the largest maximum matching ratio, on other golden standards the maximum matching ratios of DIVANC are lower than those of COACH.As figure 4 shows, the bar plots for illustrating the effectiveness in GO terms detection also clearly reveal that DIVANC, MCL are competitive among non-overlapping algorithms, while among overlapping algorithms DIVANC', COACH, LinkComm are competitive and they all outperform others like the instances about complexes detection described in tables 3 and 4. The reason for COACH and LinkComm possessing better effectiveness than DIVANC' in detecting GO terms is that both of COACH and LinkComm can obtain highly overlapping candidate metadata groups, while DIVANC' can just obtain periphery overlapping 2-club substructures.Thus in the further research on the one hand we should continue maintaining the unique graph-theoretic characteristics of 2-club substructures, on the other hand we should pay more attention to improving their overlapping extent.
To give a specific example, we especially select a simple complex named CCBL2-HBXIP-RABIF-UTP14A complex, which can be detected perfectly by DIVANC' but cannot by other algorithms.As shown in figure 5, the CCBL2-HBXIP-RABIF-UTP14A complex consists of four-subunit proteins, which is a protein complex stored in an integrated database of human genes and transcripts, the H-Invitational Database (H-InvD) [47].The proteins in green color are the members of CCBL2-HBXIP-RABIF-UTP14A complex and those in dark red color are not.We emphasize that among all the non-overlapping and overlapping algorithms, only DIVANC' can detect the CCBL2-HBXIP-RABIF-UTP14A complex perfectly, while none of the algorithms MCODE, INFOMAP, LinkComm and OSLOM2 can detect meaningful candidate complex successfully, not to mention matching perfectly with the benchmark, more details in table C1.    being able to detect valuable candidate complex successfully.

Comparing the algorithms on LFR synthetic networks.
Although as the foundation of our 2-club substructures framework, the triad-rich substructures assumption about metadata groups are observed from PPI networks, what we want to emphasize is that either DIVANC or DIVANC' can work well on general complex networks.In the following we mainly test the scalability of DIVANC on LFR synthetic networks [28,29].In the testing experiments, we use the well-known normalized mutual information (NMI) [48,49] (more details see in Appendix B) for evaluating community detection algorithms.
The testing LFR synthetic networks include a series of non-overlapping networks and overlapping networks respectively.The parameters for producing LFR non-overlapping and overlapping synthetic networks are introduced in Section 2.1.We compare the NMI values of the results obtained by the compared non-overlapping and overlapping algorithms on the synthetic networks as shown in figure 6.Each node of the figure corresponds to the average NMI value over 20 LFR networks produced on the same parameters.The NMI values of all algorithms decrease as the mixing parameter mu increases.The reason is that community structures of the LFR networks become fuzzier and fuzzier, and thus are more difficult to be detected correctly as mu increases.As figure 6(a) shows, the purple line with diamond signs represents the NMI value of DIVANC and figure 6(b) shows that the purple line with cross signs represents that of DIVANC'.Moreover, the results of the rest other algorithms are indicated by the corresponding color lines with corresponding signs as shown in figure 6. INFOMAP can obtain the best effectiveness among these compared non-overlapping algorithms and OSLOM2 has the highest NMI value among those overlapping algorithms.As figure 6 shows, DIVANC has competitive effectiveness among the non-overlapping algorithms and the second highest NMI value among overlapping algorithms.DIVANC obviously outperforms EPCA [11] and MCODE [4], while DIVANC' has better effectiveness than LinkComm [6], ClusterOne [5], and COACH [43].Both of DIVANC and DIVANC' can obtain competitive effectiveness on LFR synthetic networks reveal us that the proposed 2-club substructure is suitable for synthetic networks at certain extent, but we really need to improve it since the triad-rich assumption is observed just only from PPI networks.

Comparison on football networks.
In this section we also test them on a small social network the widely-used football networks [30,31].As introduced in Section 2.1, football network consists of 115 teams and 613 games and the 115 teams are grouped into 11 conferences, with a 12th group of independent teams (without obvious affiliations, we artificially arrange the 8 independent teams into the 12th group together for convenience) as shown in figure 7(a).We display their NMI values of the compared non-overlapping and overlapping algorithms respectively in table 5. DIVANC and DIVANC' can obtain the same result.The NMI value of DIVANC is in close proximity to the highest one of LOUVAIN among non-overlapping algorithms, while among overlapping algorithms DIVANC' gains the highest NMI value.DIVANC gains 12 2-club substructures after removing 190 edges.Surprisingly, we find the 12 2-club substructures matching the 12 real football conferences in a nearly perfect way as shown in figure 7(b).Other than three of the 8 independent teams presented by green triangles as shown in figure 7(a) are misarranged just since they are the independent teams without obvious affiliations, all of the rest teams match the real groups perfectly.As shown in figure 7(b), two independent teams Navy and Notre Dame are arranged into the green circle group and another independent teams Connecticut is partitioned into the red triangle group irrelevantly.Notably DIVANC has signally better effectiveness than EPCA [11] since there are no isolated vertices among the obtained 2-club substructures, deleting 190 edges much lower than 290 ones that of EPCA and just only three misarranged teams, much fewer than that of EPCA.It is obvious that our algorithm also has impressive effectiveness on football networks.

The advantage for detecting sparse metadata groups
Other than the above macroscopic comparisons according various indices, in this section due to the triad-rich assumption that underpins our definition of 2-club substructures, we show the advantage of DIVANC for detecting sparse metadata groups.We list the details of 4 sparse benchmarks and their corresponding candidate metadata groups detected by the nonoverlapping and overlapping algorithms in tables C2-C9 (Appendix C) and show them in figures 8-11.As described in tables C2 and C3, the density of GABAA receptor complex on HsaHPRD is 0, thus it is really a challenge to detect its candidate complexes especially for the algorithms based on density.The algorithms based on density such as MCODE, COACH, ClusterOne and LinkComm even cannot obtain any valuable candidate ones which have common proteins with GABAA receptor complex.Among the algorithms, the neighborhood affinity scores (Appendix B) between the benchmark and the candidate complexes detected by DIVANC and DIVANC' are the highest.We also list the details about DGCR6L-ZNF193-ZNF232-ZNF446-ZNF446 complex in tables C4, C5 and display them in figure 9, eEF-1 complex in tables C6, C7 and in figure 10, the 116 th complex of the golden standard MIPS in tables C8, C9 and in figure 11.The best effectiveness of DIVANC and DIVANC' in detecting sparse metadata groups again verifies the value of developing algorithms based on the triadrich substructures.proteins, where the protein SNRNA_NME1 does not belong to the input PPI networks for the incompleteness of datasets; among the detected candidate complexes the green color proteins being the members of MIPS (116 th ) complex but those in dark red color not.

Testing practical performance of edge niche centrality and 2-hop overlapping strategy 3.3.1. Practical performance of edge niche centrality.
As we all know, edge centralities plays an important role in edge division algorithms, thus in this section we want to compare edge niche centrality with edge P4 centrality to show its advantages since the former is developed based on the latter.In fact, the comparisons between edge niche centrality and edge P4 centrality are able to be implemented into the comparisons among their corresponding algorithms.We compare DIVANC with EPD2 (as introduced in above, EPD2 is a special edge division algorithm consisting of edge P4 centrality and diameter 2 stop criterion, just scrabbled up only in order to compare the practical performances of edge niche centrality with edge P4 centrality in detecting 2-club substructures).As displayed in tables 3 and 4, DIVANC detects 2151 candidate metadata groups while EPD2 obtains 1942 ones on HsaHPRD, and DIVANC detects 1128 candidate metadata groups while EPD2 obtains 1015 ones on SceDIP.Other than demonstrating their own relative indices of DIVANC and EPD2 in tables 3 and 4, we also display the differences between them in this section.We see that a detected candidate metadata group is able to match a golden standard complex or term if the score of neighborhood affinity (Appendix B) is equal or greater than 0.2 like in other parts of this paper.There are 249 candidate metadata groups detected by DIVANC which cannot match any one of the 1942 ones obtained by EPD2.In other word, there are 249 candidate metadata groups detected by DIVANC which cannot be obtained by EPD2 on HsaHPRD, and likewise we can also detect 152 ones by DIVANC which cannot be obtained by EPD2 on SceDIP.Surprisingly, among the 249 2-club substructures on HsaHPRD, there are 71 ones which are able to match at least one metadata group, and 16 of the 152 ones on SceDIP are able to match at least one metadata group.Further, some of the candidate ones detected by DIVANC that cannot be detected by EPD2 are even able to match at least one metadata group perfectly.
We display 8 candidate metadata groups detected by DIVANC but cannot by EPD2 from HsaHPRD and their corresponding matched benchmarks in figure 12 and more details in table C10; 4 those candidate metadata groups from SceDIP in figure 13 with more details in table C11.In addition to the official names of benchmarks and the gene members of benchmarks, we also list the scores of neighborhood affinity between the detected candidate metadata groups and their own benchmarks.In the columns of 'Candidate metadata groups' and 'Benchmark genes', we use bold fonts to label the common genes between candidate metadata groups and benchmark genes.In figures 12 and 13, each benchmark consists of the genes in the area circled by dotted line and the candidate metadata groups represented by the components consisting of the genes with red and green colors together.Among the genes in circled areas, the green genes are the common ones of candidate metadata groups and benchmarks.While, the bright blue, yellow and purple genes are the ones which cannot be detected by DIVANC.Notably, the yellow genes are the ones which do not belong to the used PPI networks temporarily for the incompleteness of datasets and the bright blue ones are isolated proteins from the PPI networks, thus they will never be are able to be detected by any algorithms.Only the purple ones are those missed by DIVANC.As we can see in figures 12 and 13, the genes of benchmarks cannot always be constructed as connected subnetworks also for the incompleteness property of the current PPI networks temporally.The fact that metadata groups within the networks which are not always connected subnetworks and not to mention dense subnetworks, is just the challenges for metadata groups detection.In a word, those practical effectiveness comparisons between DIVANC and EPD2 reveal an obvious advantage of edge niche centrality over edge P4 centrality.12(a-h) the 8 detected candidate metadata groups (the connected subnetworks consisting of green and red proteins) and the benchmarks in the areas circled by dotted line as listed in table C10.13(a-d) the 4 detected candidate metadata groups (the connected subnetworks consisting of green and red proteins) and the benchmarks in the areas circled by dotted line as listed in table C11.

Performance of 2-hop overlapping strategy.
The algorithm DIVANC can be extended into overlapping version DIVANC' by the proposed 2-hop overlapping strategy.In this section we mainly test the performance of 2-hop overlapping strategy in detail since it plays the role in detecting overlapping 2-club substructures.As described in tables 3 and 4, whether on HsaHPRD or on SceDIP, DIVANC' performs better than DIVANC overall.In other words, the better effectiveness of DIVANC' justifies the value of the 2-hop overlapping strategy.The proposed 2-hop overlapping strategy not only produces new candidate metadata groups which can match metadata groups, but also can improve the matching levels between detected candidate metadata groups and their own benchmarks, and even makes some candidate ones to match benchmarks perfectly.Here we list 6 candidate metadata groups detected by DIVANC which are further improved by the 2-hop overlapping strategy to match their own benchmarks perfectly in figure 14.The genes with green color are those detected by DIVANC, while the genes with red color are those detected additionally by the 2-hop overlapping strategy.Thus the overlapping algorithm DIVANC' with 2-hop overlapping strategy can detect the candidate ones consisting of green genes and red genes together.As we list the neighborhood affinity scores in table C12, the candidate ones detected by DIVANC' can match their own benchmarks perfectly.Although we demonstrate the significant performance of the 2-hop overlapping strategy mainly by comparing the results of DIVANC and DIVANC', the improved results of EPD2' from EPD2 again verify the effects of 2-hop overlapping strategy as described in tables 3 and 4 from another point of view.Thus the promotional effectiveness of EPD2' over EPD2 also shows very well that the proposed 2-hop overlapping strategy has strong portability and can be widely used to turn other non-overlapping algorithms into overlapping ones.Figure 14.Illustration of the metadata groups detected by DIVANC' matching their own benchmarks perfectly.Figures 14(a-f) the 6 metadata groups as listed in table C12, where the red triangle proteins being those searched by the 2-hop overlapping strategy and together with the non-overlapping green circle proteins matching their own benchmarks perfectly.

Conclusions and discussion
In this work, we aim to overcome the challenge that traditional structural community definitions cannot characterize intrinsic features of metadata groups comprehensively.We develop a new framework by incorporating the novel assumption of triad-rich substructures, defining 2-club substructures, designing the effective algorithm DIVANC to detect non-overlapping and overlapping candidate metadata groups that have desired graph-theoretic properties.To verify the effectiveness of triad-rich substructures, we compare DIVANC with existing algorithms on PPI networks, LFR synthetic networks and football networks.The experimental results reveal DIVANC outperforms most other exiting algorithms significantly and, in particular, can detect sparse metadata groups.
In a future study, we will attempt to study the possible applications of 2-club subclasses on complex networks from the viewpoint of graph theory since 2-club substructures have interesting internal structures.
Sn can reach its maximum by grouping all proteins in one complex, whereas PPV can be maximized by putting each protein in its own complex, we use their geometric mean Acc Sn PPV  , (B.2) as 'accuracy' to balance these two indices [43,46], where the higher Acc scores mean the better results.

F-measure
To investigate the performance of competing algorithms in detecting GO terms, we can compute the indices of F-measure [50] .F-measure ( F ) as the harmonic mean of precision and recall, thus we have

Percentage of matched GO terms
Percentage of matched GO terms which are considered to be the percentage of the GO terms which are correctly matched to at least one of the identified candidate GO terms [34,35].

Maximum matching ratio
Here we also use a measure called maximum matching ratio (MMR) [5] to evaluate relative algorithms on detection of protein complexes and GO terms.The MMR builds on maximal matching in a bipartite network, in which the two sets of vertices represent the reference and detected community, respectively, and an edge connecting a reference community with a detected one is weighted by the score of neighborhood affinity introduced in (equation B.1).We select the maximum weighted bipartite matching on this network; that is, we chose a subset of edges such that each detected and reference communities is incident on at most one selected edge and the sum of the weights of such edges is maximal.The chosen edges then represent an optimal assignment between reference and detected communities such that no reference community is assigned to more than one detected community and vice versa.The MMR between the detected and the reference community set is then given by the total weight of the selected edges, divided by the number of reference communities.MMR offers a natural, intuitive way to compare detected communities with a gold standard and it explicitly penalizes cases when a reference community is split into two or more parts in the predicted set, as only one of its parts is allowed to match the correct reference community.

Normalized mutual information
Normalized mutual information (NMI) is well known for evaluating community detection algorithms.In this paper we use the version of MGH NMI [48] to assess the similarities between detected results and golden standards on football networks and the series of LFR synthetic networks.Its definition is demonstrated as ( : ) max( ( ), ( )) where ( : ) I X Y is the mutual information, () HX , ( () HY ) the unconditional entropy of cover X , ( Y ).More details can be found in the original references [48,49].

Appendix C. Supplementary relative tables for more details
In tables C1-C9, we mainly list the names and member genes of benchmark complexes, the common genes of candidate metadata groups with bold font; we also list the scores of neighborhood affinity, the sizes, density and whether the candidate metadata groups can match the benchmark complexes.

Figure 1 .
Figure 1.Metadata groups consisting of abundantly interacted triad motifs.(a) Nuclear origin of replication recognition complex; (b) GID complex; (c) an example of triad motif.

Figure 3 . 1 M and 2 M . 2 . 4 . 3 .
Figure 3.An example for demonstrating the 2-hop overlapping strategy.Vertex f v is the overlapping vertex searched by the 2-hop overlapping strategy, which belonging to the substructures

Figure 4 .
Figure 4.The bar plots illustrating the effectiveness of non-overlapping and overlapping algorithms for detecting GO terms.Figures4(a-c) demonstrating the indices of F-measure, percentage of matched GO terms and maximum matching ratio of non-overlapping algorithms on SceDIP and HsaHPRD respectively; Figures4(d-f) displaying the corresponding indices of overlapping algorithms.

Figure 6 .
Figure 6.Illustration of the average NMI values of the results obtained by the compared algorithms on the series of LFR networks as mu from 0.1 to 1.0 with a step of 0.1.Figure 6(a) the average NMI values of non-overlapping algorithms; figure 6(b) the average NMI values of overlapping algorithms.3.1.3.Comparison on football networks.In this section we also test them on a small social network the widely-used football networks[30,31].As introduced in Section 2.1, football network consists of 115 teams and 613 games and the 115 teams are grouped into 11 conferences, with a 12th group of independent teams (without obvious affiliations, we artificially arrange the 8 independent teams into the 12th group together for convenience) as shown in figure7(a).We display their NMI values of the compared non-overlapping and overlapping algorithms respectively in table5.DIVANC and DIVANC' can obtain the same result.The NMI value of DIVANC is in close proximity to the highest one of LOUVAIN among non-overlapping algorithms, while among overlapping algorithms DIVANC' gains the highest NMI value.DIVANC gains 12 2-club substructures after removing 190 edges.Surprisingly, we find the 12 2-club substructures matching the 12 real football conferences in a nearly perfect way as shown in figure7(b).Other than three of the 8 independent teams presented by green triangles as shown in figure7(a) are misarranged just since they are the independent teams without obvious affiliations, all of the rest teams match the real groups perfectly.As shown in figure7(b), two independent teams Navy and Notre Dame are arranged into the green circle group and another independent teams Connecticut is partitioned into the red triangle group irrelevantly.Notably DIVANC has signally better effectiveness than EPCA[11] since there are no isolated vertices among the obtained 2-club substructures, deleting 190 edges much lower than 290 ones that of EPCA and just only three misarranged teams, much fewer than that of EPCA.It is obvious that our algorithm also has impressive effectiveness on football networks.

Figure 7 .
Figure 7. Illustration of the real groups and the 2-club substructures obtained by DIVANC on football networks.Figure 7(a) the football networks consisting of 12 groups; figure 7(b) the 2club substructures obtained by DIVANC' after removing 190 edges.

Figure 8 .
Figure 8. Illustration of the benchmark and candidate complexes detected by the competing algorithms about GABAA receptor complex.Figure 8(a) the benchmark of GABAA receptor complex; figures 8(b-c) the corresponding candidate complexes detected by the nonoverlapping algorithms EPCA, EPD2, DIVANC and overlapping algorithms EPD2' and DIVANC'.The benchmark consisting of three isolated bright green proteins; among the detected candidate complexes the green proteins being the members of GABAA receptor complex and those in red color not.

Figure 10 .
Figure10.Illustration of the benchmark and candidate complexes by the competing algorithms about eEF-1 complex.Figure10(a) the benchmark of eEF-1 complex; figure10(b) the corresponding candidate complexes detected by DIVANC and DIVANC'.The benchmark consisting of 6 proteins, where the bright blue protein YKR084C is isolated proteins and YBR118W does not belong to the current input PPI networks since the incompleteness of datasets; among the detected candidate complexes the green proteins being the members of eEF-1 complex but those in dark red color not.

Figure 11 .
Figure 11.Illustration of the benchmark and candidate complexes detected by the competing algorithms about MIPS (116 th ) complex. Figure 11(a) the benchmark of MIPS (116 th ) complex; figures 11(b-e) the corresponding candidate complexes detected by INFOMAP, MCL, LOUVAIN, EPCA, EPD2, DIVANC and EPD2', DIVANC'.The benchmark consisting of 9proteins, where the protein SNRNA_NME1 does not belong to the input PPI networks for the incompleteness of datasets; among the detected candidate complexes the green color proteins being the members of MIPS (116 th ) complex but those in dark red color not.

Figure 12 .
Figure 12.Illustration of the candidate metadata groups detected by DIVANC but cannot by EPD2 from HsaHPRD.Figures12(a-h) the 8 detected candidate metadata groups (the connected subnetworks consisting of green and red proteins) and the benchmarks in the areas circled by dotted line as listed in table C10.

Figure 13 .
Figure13.Illustration of the candidate metadata groups detected by DIVANC but cannot by EPD2 from SceDIP.Figures13(a-d) the 4 detected candidate metadata groups (the connected subnetworks consisting of green and red proteins) and the benchmarks in the areas circled by dotted line as listed in table C11.3.3.2.Performance of 2-hop overlapping strategy.The algorithm DIVANC can be extended into overlapping version DIVANC' by the proposed 2-hop overlapping strategy.In this section we mainly test the performance of 2-hop overlapping strategy in detail since it plays the role in detecting overlapping 2-club substructures.As described in tables 3 and 4, whether on HsaHPRD or on SceDIP, DIVANC' performs better than DIVANC overall.In other words, the better effectiveness of DIVANC' justifies the value of the 2-hop overlapping strategy.The proposed 2-hop overlapping strategy not only produces new candidate metadata groups which can match metadata groups, but also can improve the matching levels between detected candidate metadata groups and their own benchmarks, and even makes some candidate ones to match benchmarks perfectly.Here we list 6 candidate metadata groups detected by DIVANC which are further improved by the 2-hop overlapping strategy to match their own benchmarks perfectly in figure14.The genes with green color are those detected by DIVANC, while the genes with red color are those detected additionally by the 2-hop overlapping strategy.Thus the overlapping algorithm DIVANC' with 2-hop overlapping strategy can detect the candidate ones consisting of green genes and red genes together.As we list the neighborhood affinity scores in tableC12, the candidate ones detected by DIVANC' can match their own benchmarks perfectly.Although we demonstrate the significant performance of the 2-hop overlapping strategy mainly by comparing the results of DIVANC and DIVANC', the improved results of EPD2' from EPD2 again verify the effects of 2-hop overlapping strategy as described in tables 3 and 4 from another point of view.Thus the promotional effectiveness of EPD2' over EPD2 also shows very well

Table 1 .
The details of six golden standard sets and their corresponding P-values of triad distribution in PPI networks.
Figure 2. Density distribution among six golden standard sets in PPI networks.

Table 2 .
The diagram of DIVANC.

Table 3 .
Comparison with non-overlapping algorithms for detecting complexes from SceDIP and HsaHPRD.NMC Numbers of candidate complexes which can match at least one reference complex.
a Net Networks.bGS Golden standards.cAlg Algorithms.dCov Numbers of coverage proteins.eNM Numbers of detected candidate complexes.fAS Average size of obtained candidate complexes.g

Table 4 .
Comparison with overlapping algorithms for detecting complexes from SceDIP and HsaHPRD.
The foot note of table 4 being the same as that of table 3.

Table 5 .
The NMI for effectiveness comparison on football networks.

Table C2 .
The details about GABAA receptor complex detected by non-overlapping algorithms.

Table C3 .
The details about GABAA receptor complex detected by overlapping algorithms.

Table C6 .
The details about eEF-1 complex detected by non-overlapping algorithms.

Table C7 .
The details about eEF-1 complex detected by overlapping algorithms.

Table C8 .
The details about MIPS (116 th ) complex detected by non-overlapping algorithms.

Table C9 .
The details about MIPS (116 th ) complex detected by overlapping algorithms.

Table C10 .
The benchmark and corresponding candidate metadata groups detected by DIVANC but cannot by EPD2 on HsaHPRD.

Table C11 .
The benchmark and corresponding candidate metadata groups detected by DIVANC but cannot by EPD2 on SceDIP.

Table C12 .
The candidate metadata groups detected by DIVANC and DIVANC', while the latter matching their own benchmarks perfectly.