Towards Prediction and Prioritization of disease genes by the modularity of human phenome-genome assembled network

Summary Empirical clinical studies on the human interactome and phenome not only illustrates prevalent phenotypic overlap and genetic overlap between diseases, but also reveals a modular organization of the genetic landscape of human disease, provding new opportunities to reduce the complexity in dissecting the phenotype-genotype association. We here introduce a network-module based method towards phenotype-genotype association inference and disease gene identification. This approach incorporates protein-protein interaction network, phenotype similarity network and known phenotype-genotype associations into an assembled network. We then decomposes the resulted network into modules (or communities) wherein we identified and prioritized the disease genes from the candidates within the loci associated with the query disease using a linear regression model and concordance score. For the known phenotype-gene associations in the OMIM database, we used the leave-one-out validation to evaluate the feasibility of our method, and successfully ranked known disease genes at top 1 in 887 out of 1807 cases. Moreover, applying this approach on 850 OMIMloci characterized by an unknown molecular basis, we propose high-probability candidates for 81 genetic diseases.

An recent endeavor [32] in the genome-wide inference of disease genes have shown that a simple linear regression model efficiently capture the underlying architecture of the human interactome and phenome networks and suggests a global concordance of the topology between the phenotype network and the gene network. Although it achieves remarkable success, it is not available to rank candidates and select plausible susceptibility genes for a query disease in a short time due to the huge computational time for a network with large size. As suggested in a recent investigation [22], fortunately, human disease shows a modular organization on the genetic landscape. Hence, an intuitive idea towards the above-mentioned method is clustering disease based on their phenotypic similarities and making predictions for query disease within the same phenotype cluster. Inspired by this idea, we introduce a network module-based method that integrates PPI network, phenotype similarity network and known phenotype-gene associations, and then decomposes the resulted assembled network into modules ( or communities) wherein we identified or prioritized the disease genes from the candidates within the loci associated with the query disease using the linear regression model and concordance score proposed in [32].
We first verify the hypothesis of our method that human disease shows a modular organization on the genetic landscape and the modularity is significantly correlated with disease classification. In addition, to evaluate the feasibility of our method, we used the leave-one-out validation on the known phenotype-genotype associations in OMIM database-Online Mendelian Inheritance in Man database [23], and successfully ranked known disease genes at top 1 in 887 out of 1807 cases. Moreover, applying this approach on 850 OMIM loci characterized by an unknown molecular basis, we propose high-probability candidates for 81 genetic diseases.

Phenotype similarity network
A phenotype network consisting of disease phenotypes as nodes and phenotypic similarity as edges was constructed by van Driel et al. [30] using the OMIM database [23]. Considering some OMIM records have since been moved to other records, we removed them from the network to avoid technical error. We obtained a network with 4146 phenotypes and 29489 edges using a similarity of 0.5 as the threshold. However, selection of the threshold had no significant effect on the modularity of the disease network (Table 1). In order to link phenotypes in the network to known disease genes, OMIM genotype-phenotype associations for 1807 genes and 2265 disease phenotypes were downloaded on December 2 2008 from Ensembl [10] using the data mining tool BioMART [13].
We used the classification of disease phenotypes which was manually established in [9] to evaluate the significance of modularity in the phenotype network. 2929 phenotypes were classified into 22 primary disorder classes based on the physiological system affected. We did not merge several phenotypes into a single disorder as Goh et al. did [9], and we obtained 1184 phenotypes within 21 disease classes in our phenotype network.

PPI network
We obtain 34364 manually curated PPIs between 8919 human proteins from the HRPD database [27] and call this resulted PPI network "GeneNet1". In order to avoid biased towards betterstudied proteins, we also obtain 33049 predicted human PPIs between 7185 (4116 are absent from HPRD) proteins from the OPHID database [4], which is built by mapping PPIs from high-throughput screen of model organisms to human. The extended protein network, called "GeneNet2", combines HPRD, OPHID and two other curated PPI databases: BIND [1] and MINT [5], yielding a network of 72431 unique pairwise binary interactions between 14433 human proteins.

Computation of dyadicity D and heterophilicity H
Dyadicity (D) and heterophilicity (H) are two network properties of nodes which were recently proposed [25] for quantifying whether nodes with similar characteristics have a tendency to link to each other. We used these parameters to investigate whether phenotypes in a specific physiological class tend to cluster together in our phenotype network. The value of a phenotype depends on whether it belongs (1) or does not belong (0) to a disease class. Thus three types of links between phenotypes exists: 1-1, 1-0, and 0-0; the number of these links are termed m 11 , m 10 and m 00 , respectively. The two parameters dyadicity D and heterophilicity H are defined as: If D ≫ 1 and H ≪ 1, phenotypes in the specific disease class must have a clear clustering tendency within the network.
The expected value ofm 11 andm 10 is computed next. If we take cancer as an example, we can call n 1 the number of phenotypes belonging to cancer and n 0 the number of other phenotypes. N = n 1 + n 0 is the total number of phenotypes and is the total number of edges in the network. Let p := 2M N (N −1) represent the connectance that indicates the average probability that two phenotypes are connected in the network. The value of a phenotype depends on whether it belongs to a cancer class (1), or does not (0). The three varieties of link styles between phenotypes are 1-1, 1-0, and 0-0, and the number of these links can be labeled as m 11 , m 10 and respectively. If any phenotype in the network has an equal chance of being cancer, the expected values of m 11 and m 10 arem 11 andm 10 respectively[25] Statistically significant deviations of m 11 and m 10 from their expected values ofm 11 andm 10 imply that cancer phenotypes are not distributed randomly in the phenotype network.
Dyadicity D > 1 (D < 1) indicates that phenotypes in the disease class tend to connect more (less) densely among themselves than expected for a random configuration. Similarly, heterophilicity H > 1 (H < 1) means that phenotypes in the disease class have more (fewer) connections to phenotypes in other classes than expected randomly. If D ≫ 1 and H ≪ 1, phenotypes in the specific disease class must have a clear clustering tendency within the network.

Extracting the modules of the phenotype network
We detected modules in the phenotype network using the spectral algorithm based on modularity Q defined as (more details, see [20] and reference therein) where m is the number of modules, e ii are the fraction of the edges that connect two nodes inside a module i, and e ij are the fraction of the edges connecting nodes of module i and j.
The modularity Q of a partition is high when the number of intra-module edges is much larger than expected for a random partition. We identified modules by maximizing the modularity Q so that there were many intra-module edges and few between-module edges. However the method could not identify the hierarchical structure of the modules. Therefore, we decomposed all modules which had more than 100 phenotypes into sub-modules.
The number of final modules which are based in the secondary level of modularity may affect the results. We managed to reduce the effect by visually inspecting each sub-network with more than 100 phenotypes in the first level modules while automatically decomposing the phenotype network using our previous algorithm [12].

Computing p-value for the disease class enrichment of modules in the phenotype networks
We then used the disease classification dataset to see if the disease phenotypes within a single module tended to fall within the same disease class. We utilized the method introduced in [31] that computes a p-value for the functional enrichment of modules in PPI networks. Take cancer as an example. For a given module M we randomly selected a set of phenotypes which had the same number of members as M, and counted how many of them are cancer. The p-value was calculated as the probability that the number of cancer phenotypes in a random group would be equal to or greater than what we observed in M. We used 100,000 simulations to obtain the p-values.

Regression model and the concordance score
In each module, we used the regression model proposed in [32]. They assumes that additivity of the contribution to phenotype similarity from different disease genes and is defined as where S pp ′ is the similarity score between a query phenotype p and another phenotype p ′ , and d gg ′ is the topological distance between gene g and g ′ on the PPI network. G(p) denotes all disease genes belonging to the phenotype p. The Gaussian kernel e −d 2 gg ′ is used to transfer gene-gene distance to gene-gene closeness. C p i s a constant, and β pg is the coefficient of this regression model, respectively. C p could be explained as the basal similarity between p and other phenotypes whose causative genes are not connected to those of p in the protein network, and β pg represents the level of the gene g contributing to the similarity of the phenotype p to any other phenotype p ′ . We consider the topological distance d gg ′ as the graph theory SP length between genes g and g ′ in the protein network.
To quantify the association between a phenotype and a gene, we define the closeness of gene g to phenotype p ′ as the summation of gene-gene closeness from gene g to all disease genes of phenotype p ′ , as Hence, equation (1) can be written as where the vectors S p = (S pp 1 , . . . , S ppn ) and Φ g = (Φ gp 1 , . . . , Φ gpn ) are described the similarities between the query phenotype p and all n phenotype in the same module and the closeness between gene g and all these n phenotypes respectively. Thus, in this linear regression model, we define the Pearson correlation coefficient as the concordance score where cov and σ mean covariance and standard deviation, respectively. This concordance score measures the consistency between the position of gene g in the protein network and the variations of phenotype similarity for phenotype p in the whole phenotype network. It is then used to rank all the candidate genes for a specific phenotype.

Benchmark tests
A leave-one-out cross-validation procedure is used to assess the performance of our method. In this procedure, we remove the direct link between true disease gene g and phenotype p, and see if the method can recover this link (rank gene g at the top of the N test genes). This is carried out by taking known disease gene g as unknown when calculating Φ g i p , the closeness from test gene g i to query phenotype p. For phenotypes with more than one known causative genes, we modified the definition of a successful prediction: for a test case (p, g) in which p has k known disease genes, if gene g is among the top k-ranked genes, we consider it a successful prediction.

Modularity of human phenotype network
Visualization of the human phenotype network using 21 disease classes indicated that phenotypes within the same disease class are clustered into densely connected groups (Fig.2). Most of the disease classes are dyadic (D ≫ 1) and heterophobic (H ≪ 1), revealing a highly modular structure ( Fig. 3(a)). However, a few disease classes are heterophilic, suggesting that they tend to have phenotypes that overlap among different categories of diseases. These diseases include developmental, skeletal and 'ear, nose, throat'. For instance, developmental diseases, in which a delay occurs in physical or mental development, tend to overlap with other diseases. This is logical because most developmental disorders would be expected to affect multiple tissues. An interesting observation is that although phenotypes in the 'ear, nose, throat' class have strong heterophilicity with other disease classes, the dyadicity of this class is very large (∼180). This suggests that the 'ear, nose, throat' class may be a densely connected part of the network even though it has many connections to phenotypes in other classes.
We used the spectral algorithm to decompose the phenotype network into modules based on modularity Q. The maximal modularity Q equals 0.78, which indicates a distinctly modular  structure rather than a random network. The network was partitioned into 28 modules in the first partition. In order to identify the hierarchical structure of the modules, we decomposed all modules which had more than 100 phenotypes into sub-modules. If the sub-network of phenotypes in a module had a clear secondary modular structure (Q ≥ 0.5), we used the submodules instead of the first level one. This yields 231 modules, 214 of which were based on the secondary level of modularity (see Supplementary File 1 for more details).

Modularity is significantly correlated with disease classification
The p-value profile for 21 disease classes in each module with more than five phenotypes is demonstrated in Fig.3(b). Almost all modules were significantly enriched with one or two (three of few modules) disease classes when we used 10 −3 or 10 −4 as the cutoff for statistical significance of each class.
van Driel et al. [30] constructed a set of phenotype similarities by text-mining all records that describe genetics disorders in the OMIM database using medical subject headings (MeSH). The nature of the similarity measure ensures that two phenotypes will be connected in the network if they have similar clinical traits. As expected, phenotypes in a disease class tend to group in the network. However, they can be divided into many different modules. For example, phenotypes in the neurological disease class are distributed into about ten modules; of them, one module contains primarily ataxia phenotypes, such as spinocerebellar ataxia and cerebellar ataxia; and one module contains mostly Charcot-Marie-Tooth disease phenotypes. Thus modules generally are subclasses of the primary disease classes. In addition, in several cases, some phenotypes in different disease classes may be grouped together because they have similar clinical traits. These results indicate that the network method can not only provide a computational validation of the disease classification which was determined manually by Goh et al. [9], but also provide a more specific classification of disease phenotypes.

Benchmark test:leave-one-out validation
To examine how well our method reflects the biological truth, we took each of the 1807 known gene-phenotype association as one test case, and for each case we generate an artificial loci [14,26].
For each gene-disease link, we simulate a linkage locus around the true disease gene by including 181 neighboring genes as negative controls to simulate the median size of linkage intervals for OMIM phenotype loci with unknown molecular basis [14](see Supplementary Figure S3 for the distribution of gene numbers in all the disease loci). This strategy for resembling known disease loci in the OMIM database has been widely used in previous studies [14,32]. The 181 test genes are then treated equally by assuming links to the disease under study and go through the network prediction procedure. We then calculate the concordance score for each test gene, and rank the test genes according to the score. If the known disease gene is ranked as top 1, we consider it a successful prediction, and we define the precision as the proportion of successful predictions among all predictions. We set a threshold and only make prediction when the highest score of all test genes in a case is no less than it, and define recall as the fraction of true disease genes predicted among all disease genes [14]. A leave-one-out cross-validation (see Materials and Methods) shows our method can at least rank known disease genes at top 1 in 887 out of 1807 test cases, achieving a precision of 0.49 and a recall of 0.49. With the increasing of the threshold, the precision becomes larger while the recall becomes smaller. For high-scoring candidates, the precision can approach 0.73 while maintaining a high recall of 0.31 (Fig.4). Compared with GeneNet1, the precision using GeneNet2 is smaller, varying from 0.49 to 0.65. The reason is that there are many genes in network GeneNet2 have been not well-studied and their association with disease have been not identified.

Prioritization candidates in disease loci
The above results indicate that our method can efficiently predict the human disease genes from genetic loci. Therefore, we applied our procedure to 850 OMIM phenotype entries with at least one mapped disease locus but unknown molecular basis. We obtained predictions for 81 loci, only top 20 of which are summarized in Table 2. All the predictions are available in  Table S2. We obtained predictions for 81 loci, 67 of which where only from the GeneNet1, 5 only from the GeneNet2 and 9 from both. Interestingly, in 4 of the latter cases, the list of candidates from the two networks contained at least one common gene.
Notably, for three OMIM phenotypes (163000, familial multiple nevi flammei; 268700, saccharopinuria; 300195; AMMECR1) our predictions include the actual disease genes that, although not yet correctly annotated in OMIM, have been found to be mutated in patients.
For 22 loci, at least one of the candidates obtained from either network was already known to be involved in phenotypes similar to those described for the locus. These genes represent the most obvious candidates and our results should be considered as further, independent evidence for their possible involvement in the disease. However, it must be noted that some of them were previously excluded, either by the direct identification of crossovers or by the negative results of mutation screenings. Nevertheless, since mutations have most likely been searched only within the annotated exons, we think that the decision to definitively rule out the involvement of such candidates should be taken cautiously. Moreover, even silent exonic mutations, although often considered innocuous polymorphisms, can have severe effects on proteins by disrupting splicing patterns [24,7]. In most cases only few candidates are given for a locus, thus providing extremely focused working hypotheses for the identification of the actual disease genes, which in many cases are made even stronger by the available sequence or functional information. For instance, one of the two candidates provided for the OMIM phenotype entry 607221 (partial epilepsy with pericentral spikes, located on 4p15) corresponds to KCNIP4 [19]. This protein has been show to specifically modulate the activity of Kv4 A-type potassium channels, which are well known regulators of membrane excitability [2] and have been recently involved in epilepsy [28].
Even when the number of candidates for a particular locus is substantially higher, our results may provide a strong restriction of the experimental search field, which can be further narrowed by additional evidences. For instance, the phenotype with OMIM ID 130080 (Ehlers-Danlos syndrome, type VIII), is mapped to 12p13, containing 277 genes. In this case, the GeneNet1 and GeneNet2 networks provide 8 and 4 candidates, respectively. Interestingly, the candidate with the lowest associated score is the Alpha-2-macroglobulin precursor (A2M), whose absence was previously reported in a patient with Ehlers-Danlos syndrome [17].

Fold enrichment: comparison with other methods
Various methods [14,26,21,8,29,32] have been proposed for prioritizing candidate genes, but few of them have reported the precision within their publications. Traditionally, the power of these methods is measured by their ability to enrich known disease genes over random selection, say, fold enrichment [14]: If a method successfully ranks known disease genes in the top m% of all candidate genes in n% of the linkage intervals, there is a n/m-fold enrichment on average. We compared these methods by computing their fold enrichment (listed in Table  3) that illustrate our method's potential.

Discussions and Conclusions
The success of our method can be attributed to utilizing the modular nature of human genetic disease that paved a new way for phenotype-gene association studies from several aspects. First, there is now good evidence from bioinformatic analysis that human genetic diseases can be clustered on the basis of their phenotypic similarities and that such a clustering represents true biological relationships of the genes involved. Second, one may use such phenotypic similarity to predict and then test for the contribution of apparently unrelated genes to the same functional module. Third, one can use bioinformatics to make predictions about new genes for diseases that form part of the same phenotype cluster. This is done by starting from the known disease genes and then searching for genes that share one or more functional attributes such as gene expression pattern, co-evolution, or gene ontology. Ultimately, one may expect that a modular view of disease genes should help the rapid identification of additional disease genes for multifactorial diseases once the first few contributing genes (or environmental factors) have been reliably identified.
Certainly, our approach can be improved in the following directions. First, our method is limited to genes with known protein interactions (about one-third of the entire human genome). Further expanding the protein network to embrace less reliable protein interactions (such as the OPHID network) or non-physical functional associations may increase the power to detect less-studied disease genes in practice. Second, our method suffers from the imprecision and subjectiveness in quantifying phenotype similarity. The continuing endeavor for standardizing and quantifying phenotypic description would further enhance our method. Third, like other methods for disease gene finding, our method cannot tell where the causative genetic variants are in high-rank genes. With the recent progress in the prioritization of candidate genetic variants for human diseases, it is expected that by prioritizing candidate genes and genetic variants at the same time, the two may benefit each other and facilitate the discovery of disease genes and causative genetic variants therein.
Our method also illustrates well the power of the integration of different types of networks. We suggest that the ongoing large-scale mapping of human interaction networks and systematic collection of human phenotypic data are valuable for biomedical research, and the increasing coverage and quality of human interaction network, as well as more standardized and objective phenotype descriptions will facilitate the discovery of new disease genes.

Detect the modules/submodules of the phenotype network
From a combinatorial point of view, a network is a simple graph, i.e., a pair G = (V, E) consisting of a set V , called its vertex set and a subset E of the set V 2 := {e ∈ V : |e| = 2} called its edge set, associated with a symmetric weight matrix W = (w) uv∈V ∈ ℜ V ×V ≥0 . Further, given a network G, a community structure is a partition Π of V into a disjoint union of nonempty subsets V 1 , . . . , V m whose vertices are, intuitively speaking, "more densely connected" to one another than to the other vertices of G. A famous quantitative measure for evaluating the "goodness of fit" of a partition Π to G, the modularity function Q = Q(Π), was proposed by M.E.J. Newman and M. Girwan in [3] and is represented as

Module ID Size Modularity Module ID Size
It was illustrated [4,1,2] that a high Q-value indicates that the partition Π represents a "good" community structure for G and, so, much work has been devoted in recent years to designing methods proposed towards this end for iteratively improving the Q-measure. The method, however, could not identify the hierarchical structure of the modules and thus we here decomposed all modules which had more than 100 phenotypes into sub-modules. Obviously, the number of final modules depends on the secondary level of modularity that we identified. To reduce the effect, we visually inspect each sub-network with more than 100 phenotypes in the first level modules while automatically decompose the phenotype network using our previous algorithm [2].
The network was partitioned into 28 modules in the first level. We found 16 modules with at least 100 phenotypes, of which 11 modules had a significant secondary level of modularity using Q-measure (Table S1). We scrutinized each sub-network of phenotypes in the modules using the spring-embedded layout in Cytoscap software to check if each sub-network has a clear modular structure. In the resulting spring-embedded layout, nodes with edges between them tend to be situated near each other, whereas nodes without edges between them tend to be spread apart. Finally, we found that the modularity 0.5 is an appropriate threshold value for extracting secondary level modules. Following this way, we identified 231 modules in the end, most of which (214 of 231) are based on the secondary level of modularity. Thus, we believed that this decomposition method will reveal the actual modularity of the phenotype network.

Candidate genes for OMIM loci with unknown molecular basis
We give the predicted candidate genes for 81 OMIM loci with unknown molecular basis in the following