Abstract

Diseases are closely related to genes, thus indicating that genetic abnormalities may lead to certain diseases. The recognition of disease genes has long been a goal in biology, which may contribute to the improvement of health care and understanding gene functions, pathways, and interactions. However, few large-scale gene-gene association datasets, disease-disease association datasets, and gene-disease association datasets are available. A number of machine learning methods have been used to recognize disease genes based on networks. This paper states the relationship between disease and gene, summarizes the approaches used to recognize disease genes based on network, analyzes the core problems and challenges of the methods, and outlooks future research direction.

1. Introduction

Although the human genome project has been accomplished and has achieved great success, and new methods that verify gene function with high-throughput have been applied, studying genetic problems that induce diseases is still one of the major challenges facing humanity [1]. The traditional gene mapping method is based on family genetic disease. First, genes inducing diseases are located in a chain interval. Most recent studies at recognizing disease gene that involves linkage analysis or association studies have resulted in a genomic interval of 0.5 cm to 10 cm, which contains 300 genes [2, 3]. Second, using the biological experiment method to identify each gene located in a chain interval requires a large number of human resources and capital support [4]. In addition, recognizing disease gene by checking the genes set in the interval is often not possible [5]. However, the study of candidate association works well when using a set of known functional candidate genes, which have a clear biological relationship to the disease [6]. Selecting known functional candidate genes is not easy and is often limited by a good deal of factors. The selection of functional candidate genes and prioritization candidate genes has been one of the keys in recognizing disease genes because several reorganization approaches are based on the functions of these genes.

In recent years, a number of recognizing disease gene approaches and computer tools have been developed through building mathematic models based on functional annotation, sequence-based features, protein interaction, and disease phenotype [716], such as sequence features [15], functional annotation [7, 8, 10, 13], and physical interactions [12, 13, 17]. Based on the above features, an approach to rank candidate disease genes by computing a correlation score that stands for the correlation between genes and diseases has been introduced. However, various factors may affect the association between genes and diseases.

System biology has indicated that diseases with overlapping clinical manifestations are induced by one or more mutations from the same function module [1821]. Researches in biological experiments of human disease and patterns have found that genes causing similar disease phenotype often interact with each other directly or indirectly [2224]. These discoveries have shown that positive correlation exists between disease phenotype and disease gene. Many researchers have proposed disease gene prediction methods based on gene interaction and disease phenotype similarity [7, 2529]. Recently, many approaches making full use of gene interaction and disease similarity have established the gene interaction and disease phenotype similarity network to predict disease genes. Some typical methods based on networks will be introduced in detail in this paper.

2. Datasets

In the field of biological information, construction dataset is the data foundation of all subsequent work. The validity of datasets directly affects the validity and reliability of the learning algorithm and test. Thus, building a dataset is a basic and important preparatory work.

The recognition of disease gene datasets is mainly obtained from two databases: Online Mendelian Inheritance in Man (OMIM), which is a synthesis database [3032], and Human Protein Reference Database (HPRD) [33, 34]. Although none of the datasets from OMIM or HPRD are currently complete, they are comprehensive enough [6].

OMIM has the most abundant information, most extensive resources, most comprehensive, authoritative, and timely human genes and genetic disorders based on knowledge composed to support human genetics research and education and the clinical genetics research. OMIM is daily updated and has free access and acquisition at http://www.omim.org/. In OMIM, each item has a short text summary of a generally determined phenotype or gene and a large number of links to other genetic databases [30]. Datasets of disease phenotype and gene-disease phenotype can be obtained from OMIM. However, the data from OMIM need to be disposed to recognize the disease genes [35].

HPRD is a database which curetted proteomic information suited to human proteins. Even though HPRD is updated relatively slow, it is a full-scale resource for studying the relationship between human diseases and genes [36] and is linked to an outline of human signaling paths. HPRD is also available for free at http://www.hprd.org/ [34]. The dataset of gene interaction can be obtained from HPRD [35].

3. Networks

Most research on recognizing disease genes use networks, including the disease phenotype network, protein-protein interaction network, and gene-disease phenotype network, among others. In this study, we introduce only the most commonly used networks. represents the gene (proteins) interaction network, represents the disease phenotype network, and represents the gene-disease phenotype network [6, 16].(1) is an undirected graph and denotes the gene-gene interaction. is the subset of the gene set, and expresses the interaction of genes with weight. Figure 1(a) shows . In the gene-gene interaction network, the relationship between genes is obtained from the gene-gene relationship database, which is one of the most important databases in the biological information field.(2) is also an undirected graph and denotes the disease phenotype network. is the subset of the disease phenotype set, and represents the similarity of the disease phenotype with weight. Figure 1(b) shows . In the disease phenotype network, the relationship between disease phenotypes is obtained from the phenotype relationship database, which can also be replaced by the disease-disease relationship database.(3), which is an undirected biograph, is a gene-disease phenotype network. is the subset of the gene set, and is the subset of the disease phenotype set. expresses the link between the known gene and the disease phenotype. Figure 1(c) stands for . The association between disease gene and disease phenotype can be obtained from the gene-disease relationship database.

4. Methods

In previous research, various methods, such as CIPHER, RWRH, Prince, Meta-path, Katz, Catapult, Diffusion Kernel [5], and ProDiGe, were used to recognize disease genes. In the current paper, we introduce several types of typical recognition disease gene methods.

4.1. CIPHER

CIPHER [6] is a tool for predicting and prioritizing disease genes. Furthermore, CIPHER is applied to general genetic phenotypes, which do better in the genome-wide scan of disease genes; furthermore, they are extendable for exploring gene cooperatives in complex diseases. CIPHER is based on an assumption that if two genes have the closest connection in the gene interaction, then the two genes can lead to more similar phenotypes. A regression model can be formulated according to this assumption. A score assessing how likely a gene is associated with a specific phenotype is obtained from the regression model. To construct the regression model, the similarity between phenotypes, interaction between proteins and genes, and list of associations between known disease gene and phenotype must be prepared. The next paragraph expresses the procedures of prioritizing disease genes.

For a given query phenotype and candidate genes, CIPHER first combines the gene interaction network, disease phenotype network, and gene-disease phenotype network into a single network. The similarity scores of the query phenotype with all known phenotypes in the disease phenotype network are derived directly from the phenotype network and the topological distances between the candidate genes. All known disease genes in the gene interaction network are counted and grouped on the basis of their phenotypes. The correlation between phenotypes and disease genes is obtained and acts as the concordance score for each candidate gene by using the regression model. Finally, all candidate genes for the query phenotype are ranked in line with the concordance scores. On account of different neighborhood systems, two ways are available to define the topological distance: direct neighbor and shortest path. Thus, there are two versions of CIPHER which are CIPHER-SP and CIPHER-DN [6].

The similarity scores of the query phenotype and all phenotypes in the disease phenotype network are calculated by the following formulation: In the above formulation, is the similarity score of the query phenotype and another phenotype in the disease phenotype network. is the topological distance between the candidate genes and in the gene interaction network. There exit two ways to define the topological distance, , on the basis of how to consider indirect the interaction. One way to define the topological distance is shortest path; is the graph theory shortest path length between genes and in the gene interaction network. The other way to define the topological distance is direct neighbor; is infinity when and are indirect neighbors. indicates all disease genes belonging to phenotype . is a constant and can act as the basal similarity between and other phenotypes whose causative genes are not connected to those of in the gene interaction network. is the coefficient of the regression model and stands for the level of gene contributing to the similarity of the phenotype to any other phenotype . To denote the association between a phenotype and a gene, the following formulation (2) is defined: The following vector is used to denote the similarities between the query phenotype and all phenotypes in the disease phenotype network: .

In the same way, the following vector is used to denote the closeness between the genes and the phenotypes in the disease phenotype network: . Synthesizing Formulas (1) and (2) and two vectors extends Formula (1) to the following form: In this regression model, the concordance score is defined by Formula (4): where and are the covariance and standard deviation, respectively. The candidate genes for the query phenotype are ranked according to the values obtained from Formula (4). If a gene that does not connect to any disease genes exists, then Formula (4) cannot be used and the gene will rank at the tail.

4.2. PRINCE

PRINCE is another approach based on networks for ranking candidate disease genes for a given disease and inferring the complex associations between genes. PRINCE is on account of formulating constraints on the ranking function that involved usage of prior information and its smoothness over the network.

Before using PRINCE, the gene disease composed of the phenotype network (set of gene-disease associations), gene-gene interaction network (set of gene-gene association), and at least a query disease phenotype is prepared. denotes the gene-gene interaction network, where is the set of genes, is the set of interactions, and is a weight function denoting the reliability of each interaction. Given a query disease phenotype, PRINCE ranks all the genes in .

Suppose a gene , the direct neighborhood of gene is denoted by . The prioritization candidate disease gene function is denoted by , and reflects the relevance of to . Another function is defined as prior knowledge function denoted by . In the prior knowledge function, gene is related to ; otherwise, . PRINCE computes function that is smooth over the network. Thus, function is a combination of two conditions: where the parameter weighs the relative importance of gene to gene . is a normalized form of . Formally, a diagonal matrix is defined, and is the sum of row of . is normalized by , which obtains a symmetric matrix. Here, . Formula (5) can be expressed in linear form as follows: where and are viewed as vectors of size . is a matrix whose values are given by . Given that the eigenvalues of are set in [−1, 1], , and the eigenvalues of are positive. In addition, exists.

The above linear system can be solved accurately because an iterative propagation-based algorithm works fast and is guaranteed to converge to the system solution for larger networks. Formula (6) is transferred to an iterative algorithm and is denoted as follows: where . Every node propagates the information received in the previous iteration to its neighbors. Finally, the values obtained from Formula (7) rank all the candidate disease genes for a query disease phenotype.

4.3. RWRH

Random walk with restart on heterogeneous network (RWRH) is extended from the random walk with restart algorithm to the heterogeneous network. The heterogeneous network is constructed by connecting the gene-gene interaction network and disease phenotype network by using the gene-disease phenotype relationship information. In brief, the gene-disease phenotype network is the heterogeneous network. RWRH prioritizes the genes and the phenotypes simultaneously, which is inspired by the coranking framework [37]. Given a query disease, seed nodes as genes and phenotypes are associated with the disease, and the top ranked phenotype is the most similar to the query disease.

Random walk is defined as an iterative walker transition from its current node to a randomly selected neighbor, starting at a given source node in the network. However, RWRH allows the restart of the walk in every time step at node with probability . is the probability vector at step  0, indicating that it is the initial probability vector with the sum of probabilities equal to 1. Similarly, is the probability vector at step , in which the th element holds the probability of finding the random walker at node at step . The probability vector at step is denoted as follows: where is the transition matrix of the heterogeneous network; is the transition probability from node to node ; is the restart probability in every time step. After several iterations, reaches a steady-state that is obtained by performing the iteration until the change between and falls below 10−10. is the measure of closing to seed nodes. In vector , when , node is more likely to be the seed node than node .

is the transition matrix of the heterogeneous network. In addition, consists of four subnetwork transition networks and is denoted as follows: where is the transition matrix of the gene-gene interaction network, which is the intrasubnetwork of the heterogeneous network. is the transition matrix of the disease phenotype network, which is also the intrasubnetwork of the heterogeneous network. and are the intersubnetwork transition matrixes. Supposing the probability of jumping from gene-gene interaction network to the disease phenotype network is , the reverse is the same. In the gene-gene interaction network, if a node is not connected to the phenotype. If a node is directly linked to the disease phenotype network, then the node will jump to the disease phenotype network with probability . The node will jump to other nodes in the gene-gene interaction network with probability . Thus, the transition probability from to can be denoted as follows: In the same way, the transition probability from to can be denoted as follows: The transition probability from to , which is the element of at the th row and th column, can be denoted as follows: The transition probability from to , which is the element of at the th row and the th column, can be denoted as follows:

In the above four formulations, , , and are the adjacency matrixes for the gene-gene interaction network, disease phenotype network, and gene-disease phenotype network, respectively. The adjacency matrix of the heterogeneous network can be denoted as follows:

The initial probability of the gene-gene interaction network and phenotype network is denoted by and , respectively. The initial probability of the gene network makes the equal probabilities to all the seed nodes in the gene network, with the sum of the probabilities equal to 1. The initial probability of the phenotype network is the same as the gene-gene interaction network. Thus, the initial probability vector of the heterogeneous network is denoted as . In the initial probability vector of the heterogeneous network, the parameter acts as the judge to weight the importance of each subnetwork. When , the importance of the gene-gene interaction network and the disease phenotype network are equal. If , then the importance of the gene-gene interaction network is greater than the disease phenotype network. When , the gene-gene interaction network is more important than the disease phenotype network is. and the transition matrix are substituted into Formula (8). After many iterations, steady-state is denoted as . In this way, the steady probabilities and are used to rank the genes and disease phenotypes. A web server named GeneWanderer, which is a computational method that prioritizes a set of candidate genes according to their probability to become involved in a particular disease or phenotype using HWRH or diffusion kernel, is used.

4.4. Katz

The Katz method is successfully applied to social network link prediction. Predicting the social network link is close to the problem of predicting disease genes. The Katz approach, which is based on a graph, finds the similar nodes for the query nodes in the network [38].

An adjacency matrix is available in an undirected unweighted graph. The Katz approach counts the number of walks of different lengths that connects and . These walks act as the similarity of the two nodes and . is the number of walks of length that connect and . gives a measure of similarity between and . single similarity measure based on the different walk lengths is necessary. The measure is given below, in which is a constant that restrains contributions of longer walks: The above measure is denoted as follows: If , . In this study, setting leads to the well-known Katz method: where is chosen, such that . In the case of the Katz method, the connections in the graph are weighed so that is the strength of the connection between nodes and . For the choice of , the sum over infinitely many path lengths is not necessarily considered. According to the experimental results, small values of    or obtain good performance in the task of recommending similar nodes.

The adjacency matrix of the heterogeneous network is denoted as follows: One of the advantages of Katz is , which can represent the other species if we want to study human disease phenotypes and other species disease phenotypes. Here, and represent human phenotypes and the other species phenotypes, respectively. and indicate gene-disease phenotype association of human and other species, respectively. When an experiment on human is conducted, set and . By synthesizing expressions (18) and (19), we substitute matrix into Formula (17) and obtain the similarity of genes and phenotypes from the similarity matrix.

4.5. CATAPULT

Combining data across species by using positive-unlabeled learning techniques is abbreviated to CATAPULT. And CATAPULT is a supervised machine learning method which uses a biased support vector machine (SVM), where the features are derived from walks in a heterogeneous gene-trait network.

Given a query disease phenotype, a gene is not associated with the query phenotype. Scholars report positive association between genes and phenotypes; however, the negative associations are rarely reported. In the CATAPULT approach, the unlabeled gene-disease phenotype pairs act as negative associations. The characteristics of the dataset are that only the positive associations are known, and the negative associations and a large number of unlabeled gene-disease phenotype pairs as negative associations are unknown. The general idea of CATAPULT is that the examples are not known to be negative. The false positives are not penalized heavily, but the false negatives are penalized heavily.

CATAPULT uses a biased SVM to classify the gene-phenotype pairs of humans with a single training phase. This approach draws a random bootstrap sample of a few unlabeled examples from the set of all unlabeled examples and trains a classifier to classify the bootstrap samples as negatives along with the positive samples. CATAPULT also uses the bagging technique to obtain an aggregate classifier by using positive and unlabeled examples. The algorithm description is shown in Algorithm 1. denotes the number of bootstraps, is the set of positive, denotes the number of examples in , denotes the set of unlabeled gene-disease phenotype pairs, is a penalty for false positives, and is a relatively larger penalty for false negatives. The source code can be downloaded from http://marcottelab.org/index.php/Catapult.

Initialize   , and
For :
Step 1. Draw a bootstrap sample of size   +
Step 2. Train a linear classifier using the positive training examples and as negative examples by solving:
  
  Subject to , 
   , 
   , 
Step 3. For any update:
(i)  
(ii)  
Return ,  .

Before applying any supervised machine learning approach, extracting the features for gene-disease phenotypes is essential. The features are derived from the paths in the heterogeneous network. For a given gene-disease phenotype pair, different walks of the same length and walks of different lengths can be used as features for the gene-disease phenotype pair.

4.6. Meta-Path

The meta-path approach mainly uses the technology of multilabel classification. The multilabel classification method is useful for recognizing disease genes. A gene may exhibit many diseases caused by the gene. In the above example, the gene is an instance, and various diseases are different labels. Given an instance, a large space of all possible label sets may exist, which may be exponential to the number of candidate labels. The frequently used approach to solve the above problem is exploiting correlations among different labels. In the network, exploiting the correlations among different labels denoted by nodes is an advantage.

Meta-path is defined as a sequence of relations in the network. The objects in the network are linked through multiple-type associations. Multiple-type associations help exploit the correlations among different labels for multilabel classification. In recognizing the disease genes, the labels of the genes are diseases, and the labels of the diseases are genes. The explanation of the correlations among genes is summed up in this study.

Given a set of meta-paths among the gene nodes acting as labels, , the meta-path-based label correlations can be used as follows: where denotes the index set of the genes linked to the th gene through the meta-path . denotes the feature vector of node in the input space. denotes the association between a gene and a gene set. The set of all candidate genes is denoted as , and is denoted as . In the same way, given a set of meta-paths among disease phenotype nodes acting as instances, , the meta-path-based label correlations can be used as follows: where denotes the index set of disease phenotypes linked to the th disease phenotype through meta-path .

To perform multilabel collective classification more effectively in heterogeneous information network, both meta-path-based label correlations and meta-path-based instance correlations are performed simultaneously: where the gene is set as the label set, and the disease phenotype is set as the instance set. On the contrary, the disease phenotype is set as the label set, and the gene set is as the instance set.

Some research has proposed algorithms based on multilabel collective classification. We briefly introduce the multilabel collective classification algorithm called PIPL. The algorithm roughly includes the following steps.(1)Meta-path constructions: extracting all nonredundant meta-paths for label correlations and instance correlations.(2)Training initialization: construction of extended training sets for all , by converting each instance to by using the functions in Algorithm 2. Training one classifier on each label by using the extended training sets.(3)Iterative inference: the inference step is an iterative classification algorithm. It updates the testing instance label set predictions and the relational features of label and instance correlations.

(i)  
  For each meta-path :
  (1) Get related labels for node through meta-path .
  (2)
  Return
(ii)  
  For each meta-path :
  (1) Get related labels for node through meta-path .
  (2)
  Return

5. Evaluation Methods

A comprehensive comparison should be conducted among these methods. In the next several paragraphs, we will introduce some of the key comparisons for recognizing the disease genes reported so far.

Cross-validation is the most frequently used approach in evaluating these methods. However, this method is similar to that used in a previous work, which performs leave-one-out. Each of the known gene-disease phenotype associations is taken as a test case, and a set of genes is assigned as the negative control for each test case. In each round of cross-validation, the disease phenotype is held out, a link between the disease phenotype and one of the associated genes is removed, and the link removed gene is added into the test genes. The rank of the test genes is obtained according to the recognizing methods. Several processing approaches are available for the rank, such as the enrichment score, setting a threshold, precision, recall, and receiver-operating characteristic (ROC).

5.1. Enrichment Score

Suppose the number of test genes is 100. If a recognition disease gene method ranks the actual disease gene as the highest and is sequenced first, then an enrichment of 50-fold exists. The formula of the enrichment score is Enrichment = 50/rank.

5.2. ROC Analysis

ROC analysis denotes the true-positive rate (TPR) versus the false-positive rate (FPR) subject to the threshold dividing the prediction classes. The TPR/FPR is the rate of correctly/incorrectly classified samples of all samples classified to the positive class. To evaluate the scores of disease gene predictions, ROC is explained as a plot of the number of the disease genes above the threshold versus the number of the disease genes below the threshold. The area under the ROC curve for each curve is calculated to compare the different curves obtained by the ROC analysis.

5.3. Setting a Threshold

Concordance score is calculated for each test gene. If the true disease gene ranks first based on the concordance score, then the prediction is successful, and precision is used as the proportion of the successful predictions among all predictions. Another evaluation approach is setting a threshold, in which the highest score of all test genes in this case is not less than the threshold. Thus, recall is the fraction of true disease genes predicted among all disease genes.

6. Materials and Results

In the above section, several recognition disease gene methods and evaluation methods have been mentioned. This part introduces the data used and the comparison results.

Figure 2 [17] denotes the comparison of different recognition methods by setting a threshold. In Figure 2, six recognizing methods are shown, including Catapult, Katz, ProDiGe, RWRH, PRINCE, and Degree. HumanNet gene network, which is a part of the OMIM dataset, is the dataset used to compare the different recognition methods. Figure 2 shows that Katz and Catapult do better than the others with the HumanNet gene network and the evaluation method.

RWRH is compared with CIPHER-DN and CIPHER-SP. The evaluation experiment is based on gene network containing 34,364 interactions between 8919 genes, the phenotypic similarity matrix between 5080 phenotype entities calculated by using MimMiner, and 1428 gene-phenotype links between 937 genes and 1216 phenotype entities. The comparison result is denoted by Figure 3 When , , RWRH successfully ranks 814 known disease genes as top 1. The result is denoted by L001 in Figure 3. The column of L002 is the result of removing a known gene-phenotype link and using the phenotype and the rest of the disease genes associated with this phenotype as seed nodes. The identification of disease genes for phenotype from the genome is called ab initio prediction. The ab initio method removes all the links from a phenotype to disease genes and uses the phenotype entity as seed node to run RWRH. If one of the disease genes associated to the phenotype ranks top 1, then the prediction is successful. The result of ab initio is shown in Figure 3. From the L001, L002, and ab initio, RWRH is better than CIPHER-SP and CIPHER-DN. Figure 4 denotes the result of the comparison between RWR and RWRH. Leave-one-out cross-validation is conducted for each disorder. In each cross-validation, a disease gene is selected, and the links between the phenotype entries and the disease gene are removed. The rest of the disease genes and the phenotype entry are used as seed nodes. The selected disease gene and all disease genes in the artificial linkage are ranked by RWRH and RWR. ROC analysis is used to evaluate the two recognizing approaches.

7. Conclusion

Identifying disease genes is one of the fundaments of medical care and has been a goal in biology. Although traditional linkage analysis and modern high-throughput techniques often provide hundreds of disease gene candidates, identifying disease genes in the candidate genes by using the biological experiment method time-consuming and expensive. To deal with the above issues, the methods based on networks have been proposed. Many methods based on network have been created to recognize disease genes. In this paper, five typical algorithms based on networks, namely, CIPHER, PRINCE, RWRH, Katz, and CATAPULT, are introduced in detail.

Some novel methods have been put forward to recognize and prioritize disease genes. For instance, BRIDGE [39] takes advantage of multiple regression models with penalty to automatically weight different data sources. A researcher employed the ensemble boosting learning technique to combine variant computational approaches for gene prioritization to improve overall performance [40].

Biological relationships are showed by networks, which brings forth new ideas. A network can be used to denote the association between genes and disease to recognize the gene-disease phenotype and to obtain a more complete understanding of the biological system. Networks have been successfully used in biology. However, combining experiments with networks results in the challenge of defining node similarities. Different ways to define node similarity may lead to different effects.

With the development of biology and the emergence of a large number of relevant data, disease gene research based on networks constantly matures. New machine learning methods and technologies will be used to predict disease genes. Research on disease gene recognition will achieve new breakthroughs. The disease gene research will open a new era of medical treatment.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The work was supported by the Natural Science Foundation of China (no. 61370010, no. 61202011, no. 61271346, no. 61172098, and no. 60932008) and the Ph.D. Programs Foundation of Ministry of Education of China (20120121120039).