Mimvec: a deep learning approach for analyzing the human phenome

Background The human phenome has been widely used with a variety of genomic data sources in the inference of disease genes. However, most existing methods thus far derive phenotype similarity based on the analysis of biomedical databases by using the traditional term frequency-inverse document frequency (TF-IDF) formulation. This framework, though intuitive, not only ignores semantic relationships between words but also tends to produce high-dimensional vectors, and hence lacks the ability to precisely capture intrinsic semantic characteristics of biomedical documents. To overcome these limitations, we propose a framework called mimvec to analyze the human phenome by making use of the state-of-the-art deep learning technique in natural language processing. Results We converted 24,061 records in the Online Mendelian Inheritance in Man (OMIM) database to low-dimensional vectors using our method. We demonstrated that the vector presentation not only effectively enabled classification of phenotype records against gene ones, but also succeeded in discriminating diseases of different inheritance styles and different mechanisms. We further derived pairwise phenotype similarities between 7988 human inherited diseases using their vector presentations. With a joint analysis of this phenome with multiple genomic data, we showed that phenotype overlap indeed implied genotype overlap. We finally used the derived phenotype similarities with genomic data to prioritize candidate genes and demonstrated advantages of this method over existing ones. Conclusions Our method is capable of not only capturing semantic relationships between words in biomedical records but also alleviating the dimensional disaster accompanying the traditional TF-IDF framework. With the approaching of precision medicine, there will be abundant electronic records of medicine and health awaiting for deep analysis, and we expect to see a wide spectrum of applications borrowing the idea of our method in the near future.


Background
Deciphering genetic basis of human inherited diseases is a fundamental task in human and medical genetics [1]. Typically, this task is done by applying linkage analysis to a pedigree or association study to a cohort to roughly locate genomic regions that are statistically associated with a disease of interest, and then experimentally verify functions of genes located in these regions [2,3]. In order to effectively determine target genes in functional experiments, computational methods are often used to prioritize candidate genes based on the "guilt-by-association" principle [4] by making use of multiple genomic data sources, including gene expression [5], protein sequences [6], protein-protein interaction [7], gene ontology [8,9], and many others [10][11][12][13][14]. The basic assumption in a guilt-by-association analysis is that genes associated with a disease share common functions, and thus exhibit common characteristics across a variety of genomic data sources. As such, one can infer the functional similarity of a candidate gene to a set of seed genes that are already known as associated with the disease under investigation and then use the resulting score to rank candidate genes. However, the essential prerequisite of known seed genes greatly restricts the scope of application of these methods, making them more suitable to study diseases whose genetic basis is partly known a priori.
To overcome this limitation, there have been quite a few studies make use of the human disease phenome, particularly, relationships between all human disease phenotypes [15]. With a reasonable extension of the "guilt-by-association" principle to assume that genes causative for phenotypically related diseases often share common functions, computational methods can "borrow" known disease genes from highly correlated disease phenotypes, hence enabling the prioritization analysis for diseases whose genetic basis can be completely unknown [16]. With the incorporation of more sophisticated statistical models or machine learning methods, the utilization of the human phenome has evidenced a wide range of applications in deciphering genetic basis of human diseases [17][18][19][20][21].
Most existing methods for inferring the human phenome start from the Online Mendelian Inheritance in Man (OMIM) database [22]. In a typical pipeline, one first extracts biomedical concepts (terms) from a OMIM record based on a standardized vocabulary such as the Unified Medical Language System (UMLS) [23], Medical Subject Headings (MeSH) [24], and Human Phenotype Ontology (HPO) [25]. Then, the frequency of occurrence of a term in a record is counted, yielding a statistic called the term frequency (TF) that indicates how important the term is in the record. Meanwhile, the negative logarithm of the occurrence frequency of a term in all records is measured, resulting in another statistic called the inverse document frequency (IDF) that represents whether the term provides concrete meaning. The product of these two statistics is often referred to as TF-IDF, which has been frequently used as a weighting factor in information retrieval and text mining [26]. Finally, with the TF-IDF of every term collected, one represents a record as a vector of TF-IDFs. The cosine of the angle between two vectors can then be calculated to measure the similarity between two disease phenotypes.
In natural language processing, the above pipeline belongs to a category of methods called "bag-of-words" (BOW), which can be traced back to 1950's and is now known to have several obvious disadvantages [27]. For example, words are treated independently in the calculation of TF-IDF values, and thus the semantic relationships between words are completely ignored. As an extreme example, any permutation of words in a record yields an identical TF-IDF vector as that obtained from the original record. This certainly goes against the objective of text mining. Moreover, the number of concepts is normally large, and thus the dimension of a TF-IDF vector is commonly high. In such a high dimensional space, the characterization of the similarity between two vectors is itself a difficult problem. In the case that the vectors are used in a machine learning task, the dimensional disaster seems inevitable.
Recent advances in the computer science community have evidenced several efforts to overcome the limitations of the "bag-of-words" methods. For example, Mikolov et al. introduced the skip-gram model [28], which represented words as vectors in a low-dimensional space and enabled precise prediction of the context surrounding a word. This method is ultra-efficient in that a single computer can easily train more than 100 billion words in a single day. Furthermore, sophisticated techniques such as subsampling of frequent words and the negative sampling have also been incorporated into the skip-gram model, speeding up this method by one order of magnitude with improved accuracy. Le and Mikolov further proposed a framework called paragraph vector that extends the vector representation from a single word to a sentence or even a document [29]. Besides the merit characteristic of converting unstructured text into a low dimension vector, these methods also have the ability to understand semantic relationships between words. For example, in the vector space, "Madrid" -"Spain" + "France" is closest to "Paris", meaning that the vector presentation precisely grasps the intrinsic semantic relationships of a country and her capital.
With the understanding of shortcomings of analyzing the human phenome based on the traditional methodology of TF-IDF and merit characteristics of the recent advances in vector presentation of text, we propose in this paper a method named mimvec to analyze the human phenome. Our method converts 24,061 OMIM records to low-dimensional numeric vectors (e.g., 100 dimensions) by customizing the paragraph vector methodology. We show that the resulting vector representation of OMIM records not only effectively enables the classification of diseases against genes records in the OMIM database, but also successes in the discrimination between diseases of different inheritance styles. We further calculate phenotype similarities between 7988 disease phenotypes and use this resource with multiple genomic data sources to prioritize candidate genes, yielding a novel method for fining disease genes that exhibits superior performance over existing ones. To facilitate applications of our method, we provide free downloads of pre-calculated vector presentations of 24,061 OMIM records at http://bioinfo.au.tsin ghua.edu.cn/jianglab/mimvec.

Overview of the proposed approach
The proposed method, mimvec, customized a deep learning method in natural language processing called Paragraph Vector [29] to analyse OMIM records and converted them to low-dimensional numeric vectors. As illustrated in Fig. 1, we first identified the MIM number of an OMIM record from the NO field, and we extracted from the TX field a sequence of words, in the order as they appeared in the record. In this procedure, we discarded all section captions, punctuations and numbers. Then, we represented both the MIM number and the words as low-dimensional (e.g., one hundred) numeric vectors. Next, we concatenated vectors corresponding to the MIM number and a small number of words in a sliding window (e.g., of size five) to form a new vector. Finally, we used this vector to predict the word appearing immediately after the window. In this model, an OMIM record was identified by its own MIM number, and thus was represented by a distinct vector. A word, on the contrary, was often shared by a number of OMIM records, and thus the corresponding vector was also shared across records. In this sense, the vector corresponding to an OMIM record provided information specific to the record and enabled more precise analysis of the relationship between the words within the record. The prediction task was modeled as a multiclass classification problem and solved by adopting a neural network, which took the concatenated vector as input and produced the probability of a word via a softmax function. In order to train such a model, we first initialized all vectors at random and then fed OMIM records to the model and maximized the average log probability of all predictions via a stochastic gradient descent algorithm with backpropagation (see Methods for details). When a model was well trained, we extracted the vector corresponding to a MIM number to obtain the vector presentation of the corresponding OMIM record.

Mimvec distinguishes phenotypes from genes
We asked the question of whether the vector presentation of an OMIM record could capture its intrinsic semantic characteristics. To answer this question, we converted all the 24,061 OMIM records to 100-dimensional vectors, and we explored the possibility of distinguishing the 7988 phenotype records from the rest 16,073 gene records. We first applied a principle component analysis (PCA) to the matrix (24,061×100) containing vectors of all the records and visualized the results in a two-dimensional Euclidean space composed of the first two principle components (PC). As demonstrated in Fig. 2a, we find that dots corresponding to diseases can be well distinguished from those corresponding to genes, except for a small number of outliers. Moreover, if we project the dots to the first principle component (x-axis), diseases obviously have smaller coordinates than genes. To show this observation in a clearer way, we plotted a heatmap by using the first 10 principle components (Fig. 2b). It is clear that the first principle component alone can well distinguish diseases from genes (accuracy = 91.45%), suggesting that most information contributing to the characteristics of these two different categories of records is already captured by this component.
We then performed a binary classification of disease records against gene ones by using all elements in the 100-dimensional vectors as features. We evaluated three classifiers, logistic regression (LR), random forest (RF) and support vector machine (SVM), using 10-fold crossvalidation and show the results in Table 1. From the table, we clearly see that all the three classifiers can well discriminate between these two catalogs. For example, the area under the receiver operating characteristic curve (AUC) for logistic regression is as high as 99.41%, Fig. 1 Overview of the proposed Mimvec method. Both the MIM number and the words in a sliding window are represented as low-dimensional numeric vectors, which are further concatenated to form features and used with a softmax function to predict the word followed. In the training phase, vectors are initialized at random, and OMIM records are fed sequentially to the model. The average log probability of all predictions and maximized via a stochastic gradient descent algorithm with backpropagation. When a model was well trained, vectors corresponding to MIM numbers are extracted and the classification accuracy (ACC) achieves 96.17%. Considering the situation of imbalance between the number of diseases (7988) and that of genes (16,073), we further calculated a criterion called balanced error rate (BER, the average of error rates for the two classes) and obtain a value of only 2.6%, again revealing the effectiveness of classifying diseases against genes using elements in the vector presentation as features.
We further tried different numbers of dimensions in the vector presentation and found the difference in the results is negligible for logistic regression and support vector machine. The performance of random forest, however, tends to drop with the increase of the number of dimension, especially for BER, suggesting that low dimensional vectors are preferred.
All these results support the conclusion that the vector presentation of OMIM records can indeed capture intrinsic characteristics of the records. Furthermore, the number of dimensions of the presentation is not critical for the capture of information implicated in the records.

Mimvec distinguishes diseases of different inheritance styles
We further asked the question of whether the vector presentation of phenotypes could distinguish diseases of different inheritance styles. To answer this question, we identified 1853 autosomal dominant diseases (MIM: 1xxxxx) and 1547 autosomal recessive diseases (MIM: 2xxxxx) from the OMIM database, and again applied the three classifiers (LR, RF and SVM) to classify autosomal dominant diseases against autosomal recessive ones. Results of the leave-one-out cross-validation experiments, as shown in Table 2, give us a positive answer to this question. For example, when presenting a phenotype as a 100-dimensional vector, the AUC, ACC and BER for logistic regression are 89.85, 83.21 and 17.05%, respectively. The other two methods also achieve reasonably high performance (RF: AUC = 87.29%, ACC = 79.32% and BER = 21.59%; SVM: AUC = 89.80%, ACC = 83.18% and BER = 17.10%). These results reveal the effectiveness of the vector presentation in the classification of diseases of different inheritance styles. Furthermore, we tried different numbers of dimensions in the vector presentation and found the performance of LR and SVM tends to improve with the increase of the number of dimensions, while that of RF tends to drop, though the change is itself small ( Table 2).
We then identified 48 immune diseases and 263 neurological disorders according to the Genetic Association database [30], and we also applied the three classifiers (LR, RF and SVM) to discriminate these two catalogs of diseases. Results of the leave-one-out cross-validation experiments, as shown in Table 3, demonstrate the possibility of solving this binary classification problem using the vector presentation of disease phenotypes as features. For example, with 100-dimensional vectors, the AUC, ACC and BER for SVM are 85.62, 86.13 and 29.14%, Fig. 2 Mimvec distinguished phenotypes from genes. a Projection of OMIM records to a two-dimensional Euclidean space using the first two principle components shows that diseases can be well distinguished from genes. b The first component alone can already discriminate diseases from genes with high accuracy respectively. The other two methods also achieve reasonably high AUC and ACC (LR: AUC = 80.63%, ACC = 81.94%; RF: AUC = 78.20%, ACC = 85.16%). However, we notice that BER for RF is only 44.57%, suggesting that this classifier tends to assign wrong label to one of the catalogs (i.e., immune diseases). We guess the phenomenon is due to the imbalance of the training samples (48 versus 263). We further tried vectors of different dimensions and found the performance of LR and RF tends to drop with the increase of the number of dimensions. For SVM, the classification performance is quite stable for different number of dimensions (Table 3).

Mimvec links phenotypes to causative genes
A fundamental problem in human genetics is to link phenotypes to genotypes. In medical genetics, it is of great importance to identify genes responsible for a disease phenotype. This is typically done by applying linkage analysis or association studies to identify genomic regions that show strong association with a specific disease of interest and then prioritizing candidate genes located in these regions by making use of such genomic data sources as gene expression [5] and protein-protein interaction [7]. With disease phenotypes and genes represented as vectors by our unsupervised deep learning approach, we explore the possibility of linking a disease phenotype to its causative genes by using their vectors alone.
From the OMIM database, we identified 4397 associations between 3798 diseases and 2944 genes. We then quantified the similarity between every pair of disease and gene by using the cosine measure, and we plotted distributions of the resulting similarities. As shown in Fig. 3a, the distribution of similarities between diseases and their known causative genes exhibit an obvious positive shift against that between diseases and genes selected at random, suggesting the possibility of utilizing this cosine measure as a score to distinguish true causative genes from irrelevant ones. With this understanding, we conducted a leave-one-out experiment to simulate the ambitious goal of identifying causative genes for a specific disease. In detail, for each of the 4397 known associations between a disease and a gene, we ranked the gene against other genes according to the cosine score (the larger the better), with genes known as associated with the disease excluded. As shown in Fig. 3b, 2379 (54.11%) known disease genes are ranked first, 422 (9.60%) ranked second, 197 (4.48%) ranked third. In contrast, with a random guess procedure, one could only expect to see 0.2736 (4397/16,073) known disease genes ranked first. In other words, prioritizing candidate genes according to the cosine similarity score yields a fold enrichment of more than 8695 (2379/0.2736), strongly suggesting the effectiveness of this method for finding disease genes.  We further repeated the above experiments with the use of three alternative similarity measures (Pearson's correlation coefficient, Spearman's correlation coefficient, and normalized Euclidean similarity). As shown in Fig. 3a, for all the three measures, the distribution of similarities between diseases and their known causative genes also exhibits an obvious positive shift against that between diseases and genes selected at random. As shown in Fig. 3b, all the three similarity scores are also capable of distinguishing causative genes from irrelevant ones. Moreover, we notice that the cosine and Pearson's correlation coefficient measures yield similar performance, which is obviously higher than that of Spearman's correlation coefficient and normalized Euclidean similarity.

Mimvec bridges phenotype similarity and gene functional overlap
Most computational methods for finding disease genes obey the guilt-by-association principle, which assumes that genes associated with a disease would have similar functions [4]. A more general assumption is that genes associated with phenotypically similar diseases exhibited functional similarities across different genomic data sources [19]. This assumption has become the basis for designing gene prioritization methods, with such successful stories as PRINCE [21], pgFusion [18], pgWalk [17] and many others [15]. With our method to characterize phenotype similarity based on vector representations of phenotypes with the use of the cosine measure or the alike, we explore whether the resulting phenotype similarity between diseases also implies genotype overlap.
We identified 4593 associations between 3921 diseases and 3023 genes using the tool BioMart [31]. For every pair of diseases, we measured their phenotype similarity based on the vector presentation (100 dimension) using the cosine score, and we measured their genotype overlap as the average pairwise similarity scores of their associated genes under a genomic data source. Particularly, we adopt four types of genomic data (RNA-seq, protein sequence, protein-protein interaction, and gene ontology, as described in Methods). We then partitioned the phenotype similarity scores into 10 bins of equal size, identified disease pairs belonging to each bin, and calculated the average genotype similarity of disease pairs in each bin, as shown in Fig. 4. From this figure, we observe strong correlations between the phenotype similarity and the genotype overlap. Taking gene expression derived from RNA-seq data as an example, it is shown that the genotype similarity increases with the increase of the phenotype similarity (Fig. 4a). Particularly, for disease pairs with very weak phenotype similarity (< 0.3), their genotype similarity is also very weak (near zero, not shown). For disease pairs with strong phenotype similarity (0.9~1.0), their genotype similarity is also strong (0.4630 on average). In the middle of the spectrum, for disease pairs with medium phenotype similarity (0.5~0.6), their genotype similarity is also at the medium level (0.1378 on average). We further regressed the mean genotype similarity of each bin against the corresponding mean phenotype similarity. For the other four genomic data sources, we observe similar patterns (Fig. 4 b-d). These results clearly suggest that genes associated with phenotypically similar diseases indeed exhibit functional similarities across different gnomic data sources. We further regressed the mean genotype similarity of each bin against the corresponding mean phenotype similarity. Results show that the coefficients of determination (r 2 ) are 0.9621 for the gene expression, 0.7573 for protein sequence, 0.7140 for protein-protein interaction, and 0.6860 for gene ontology. These results further suggest that the phenotype similarity derived from the vector presentation implies the genotype overlap.

Mimvec enables the prioritization of candidate genes
With the assumption that genes associated with phenotypically similar diseases exhibited functional similarities across different genomic data sources being validated, we further implemented a random walk with restart model (see Methods for details) to prioritize candidate genes to demonstrate how the vector presentation of phenotypes can be used to facilitate the identification of disease genes.
We performed three large-scale leave-one-out crossvalidation experiments to validate the effectiveness of this method using the 4593 associations between 3921 diseases and 3023 genes. We fist simulated the situation of a traditional linkage analysis or association study, in which the objective was to prioritize candidate genes in a linkage interval. In each validation run, we focused on one known disease-gene pair in an annotated association and looked at whether our method could correctly identify the gene from a set of control genes that were located within a 10 Mb region centred at the test gene (i.e., the gene associated with the disease), and ranked the test gene against the control genes using our method. In this procedure, we removed all annotated associations regarding the query disease to simulate the circumstance that the genetic basis of the query disease was completely unknown.
We derived three criteria to quantify the performance of our method. Dividing the number of test genes ranked first by the total number of candidate genes, we obtained a criterion called the top ranked test genes (TOP). Dividing the rank of a test gene by the total number of test and control genes in a validation run, we obtained the rank ratio of the test gene. Averaging rank ratios of all test genes, we obtained a criterion called the Mean Rank Ratio (MRR). At a certain threshold of the rank ratio, we defined the sensitivity and the specificity as the fraction of test and control genes ranked above and below the threshold, respectively. Varying the threshold, we plotted the rank operating characteristic (ROC) curve (sensitivity versus 1-specificity) and further calculated the area under this curve as a criterion called the AUC score.
As shown in Table 4 and Fig. 5, TOP, MRR and AUC for validation experiment against a linkage interval with When using the phenotype similarity derived from MeSH, the TOP, MRR and AUC are only 28.72, 20.97 and 78.35%, respectively. When using the phenotype similarity derived from HPO, the TOP, MRR and AUC are only 28.22, 20.95 and 79.34%, respectively. Since the genotype similarity data are the same in the above comparison, these comparisons suggest that the vector  presentation of phenotypes, though does not resort to any prior knowledge about biology and medicine, is superior to the TF-IDF counterparts that utilize UMLS, MeSH and HPO. We further repeated the above experiments by using gene similarity derived from protein sequence, protein-protein interaction data and gene ontology, and the results give us similar conclusion. Particularly, when fixing a phenotype similarity measure, performance of the four genomic data can be sorted as protein-protein interaction > gene ontology > protein sequence > gene expression. When fixing a genotype similarity measure, performance of the four phenotype similarity measure can be sorted as mimvec > UMLS > MeSH > HPO. The number of control genes in a linkage interval may have variation, thereby introducing biases in assessing the capability of a method in enriching test genes at top positions (e.g., ranking a test gene among top 10 against 20 control genes is much easier than ranking it among top 10 against 100 control genes). We therefore performed another validation experiment (i.e., nearest neighbor) by ranking each test gene against 99 control genes that were closest to the test gene in the same chromosome. From the results shown in Table 4, we draw the same conclusion as we have for linkage interval. Briefly, when fixing a genotype similarity measure, performance of the four phenotype similarity measure can be sorted as mimvec > UMLS > MeSH > HPO, no matter which evaluation criterion is used.
We further simulated the situation of exome sequencing studies, in which genetic variants are sequenced across the whole exome. In each validation run, we focused on one disease-gene pair in an annotated association and ranked the test gene against a set of 99 control genes that were selected at random from the entire genome. From the results shown in Table 4, we draw the same conclusion as we have for linkage interval and nearest neighbor. That is, when fixing a genotype similarity measure, the performance of the four phenotype similarity measure can be sorted as mimvec > UMLS > MeSH > HPO.
The above results are obtained by representing a phenotype as a vector of 100 dimensions. To study the possible influence of the number of dimensions to the performance of our prioritization method, we varied the vector size from 50 to 300 with step 50 and repeated the cross-validation experiments. As shown in Table 5, in general, the performance is quite robust for different number of dimensions, since the evaluation criteria do not vary much with the variation of the numbers of dimensions. In more detail, a relatively small number of dimensions (e.g., 50) can already give us reasonably good performance (higher than UMLS, MeSH and HPO). A relatively large number of dimensions does not show much help in improving the performance. When the number of dimensions is greater than 200, we even observe a drop of the performance. Considering that the number of parameters in the neural network increases linearly with the number of dimensions, and thus the computational burden also increases, a medium number of dimensions (e.g., 100) seems already a good choice.

Discussion
We conjecture that the success of our method can be attributed to the combination of the following aspects. First, our model considers local semantic relationships between words instead of treating words as independent units. From the viewpoint of natural language processing, relying on a standardized vocabulary (e.g., UMLS, MeSH and HPO) and the TF-IDF measure to analyze OMIM records belongs to a class of methods called "bag-of-words" (BOW), which ignores relationships between words and thus cannot capture semantic meaning of nearby words. In contrast, our method overcomes this limitation by predicting a word using its predecessors, and thus implicitly takes semantic relationships between words into consideration. Second, our model represents a record as a lowdimensional dense vector, while traditional methods based on TF-IDF describes a record as a high-dimensional sparse vector. However, a large number of dimensions usually leads to a disaster in machine learning, yielding such hard tasks as feature selection. Besides, the precise measure of the similarity between two vectors in a highdimensional space is itself a difficult problem, no mention the fact that the space is very parse. The main weakness of our vector representation method is that a dimension does not have the concreate meaning. In methods based on TF-IDF, a dimension corresponds to a term or concept in a standardized vocabulary, and thus it is convenient to explain the meaning of an element in a TF-IDF vector. Our method, however, embeds and compresses a record into a low dimensional vector, and hence the meaning of a dimension is not clear. With understanding, it might be worth pursuing to seek for a methodology similar to deconvolution to dissect the meaning of a vector. Another possible improvement of our method is to stand on the shoulders of already established fruitful biomedical knowledge. Although we have demonstrated the effectiveness of the vector representation without the use of any prior information about biomedical concepts, starting from scratch certainly wastes such knowledge that has been accumulated for a long period of time. In this sense, it might be worth pursuing to incorporate biomedical knowledge into our approach to further improve its effectiveness. Finally, the methodology of representing a document as a vector is not specific to the analysis of OMIM records. With the coming of precision medicine, there are abundant electronic records of medicine and health awaiting for deep analysis. We expect to see a wide range of applications borrowing the idea of our method in the near future.

Conclusions
In this paper, we have proposed a deep learning approach named mimvec to analyse the OMIM database, with particular emphasis on the human phenome. We have shown that the unsupervised conversion of OMIM records to low-dimensional vectors effectively enables the classification of diseases against genes, the discrimination between diseases of different inheritance styles, and the prioritization of candidate genes. When utilized with multiple genomic data sources, the similarity measure derived from vector presentation of phenotypes with no prior knowledge exhibits superior performance over traditional measures derived from ULMS, MeSH and HPO in a model for prioritizing candidate genes.

Data sources
We extracted 24,061 records from the OMIM database (accessed in April 2015) and identified 7988 disease phenotypes and 16,073 genes, represented by MIM numbers. We identified 20,327 genes from the Ensembl database (accessed in November 2015), represented by Ensembl gene ID. We extracted 4606 associations between 3933 diseases and 3028 genes by using the tool BioMart [31]. On average each disease was associated with 1.17 genes, and each gene was relevant to 1.52 diseases. We downloaded raw sequencing data of 503 RNA-seq experiments from the ENCODE projects and calculated expression levels (FPKM) of the 20,327 genes by using the standard Tophat and Cufflinks pipeline. We extracted sequences of 20,272 human proteins from the Swiss-Prot database (release 2014_01). We extracted 403,514 interactions between 13,747 human proteins from the STRING database (version 9.1). We extracted the biological process domain of the gene ontology and downloaded associated annotations for 15,602 human genes (both released on 2014- [11][12][13][14][15][16][17][18][19][20][21][22].

Vector presentation of phenotypes based on a standardized vocabulary
We adopted a text mining technique to characterize an OMIM record of human disease phenotype by using a standardized vocabulary. First, we splinted sentences in the TX and CS fields of a record into words, and we mapped these words onto UMLS concepts by using the program MetaMap [32] (Version 2016 V2), obtaining 8446 concepts for describing human disease phenotypes. For each of these concepts, we counted its occurrence frequency in the record to obtain a statistic called term frequency (TF), calculated the negative logarithm of its occurrence frequency in all OMIM records to obtain a statistic called document frequency (IDF), and derived a quantity called TF-IDF as the product of the TF and IDF values. Concatenating this quantity for all concepts where l r is the length of the r-th record, and T the total number of windows scanned. With a softmax function, the likelihood p(r, t, k) is calculated as where the summation is taken over all possible words, and y r tþk is the predicted vector derived from the record vector m r and word vectors w r t ; …; w r tþk−1 . The j-th dimension of y r tþk is calculated as with α, β and γ 's being parameters. Considering that the number of words is typically large (~10 5 ), a hierarchical softmax is often adopted for fast training [29]. In order to train the neural network model for the softmax classifier, stochastic gradient descent is often used, and the gradient is obtained by backpropagation. In our study, we default the window size to 5. Our empirical analysis also shows the robustness in the selection of this parameter.

Derivation of phenotype semantic similarity
Given the vector presentation of disease phenotypes, we characterized semantic similarity between two phenotypes by calculating the cosine of the angle between the corresponding vectors. We also adopted other similarity measures, including Pearson's correlation coefficient, Spearman's correlation coefficient, and Euclidean similarity, in the comparative study. Particularly, the Euclidean similarity was transformed from the standard Euclidean distance via a linear transformation to ensure the similarity was in the range of [0,1], while the other three similarity measures were already in such a range according to our numerical analysis. In a similar manner, we characterized the semantic similarity between a phenotype and a gene, and that between two genes.

Derivation of gene functional similarity
With the RNA-seq data of [33], we characterized a human gene using a 503-dimensional numeric vector that represented logarithm of FPKMs of the gene across the experiments. For a pair of genes indexed by i and j, we calculated the absolute value of the Pearson's correlation coefficient of the corresponding vectors to obtain their raw similarity scores r ij gexp ð Þ. We further applied an exponential transformation to convert the raw score into a functional similarity score, as where σ gexp ð Þ ij the standard deviation estimated from raw scores for all pairs of genes, and λ a tuning parameter with defaulting value 1.
We calculated pairwise local sequence alignments of human proteins using the Smith-Waterman algorithm implemented in SSEARCH [34]. We then constructed a sequence similarity network of these proteins by connecting two proteins with an undirected edge if their alignment e-value is less than a predefined threshold (10 −4 ).
Next, we calculated the shortest path distance (δ gexp ð Þ ij ) for every pair of proteins i and j in this network and converted it into a similarity value in the range of 0 and 1 by a linear transformation (r ). Finally, we applied the exponential transformation to convert a raw score to a functional similarity score.
We extracted interactions between proteins from the STRING database (Version 9.1) [35] and constructed a protein-protein interaction network accordingly. Then, as was done for protein sequences, we calculated the shortest path distance (δ gexp ð Þ ij ) for every pair of proteins i and j in this network and converted it into a value in the range of 0 and 1 ( r ). Finally, we applied the exponential transformation to convert a row score to a functional similarity score.
We identified 26,784 concepts from the biological process domain of the gene ontology [36] and characterized each human gene using a numeric vector of such number of dimensions. Here, each element in a vector was the information content of the corresponding concept. We calculated the raw similarity score between a pair of genes as the cosine of the angle between the corresponding vectors and applied the exponential transformation to convert a raw score into a functional similarity score.

Random walk for prioritizing candidate genes
Given a semantic similarity measure for phenotypes, we could construct a nearest neighbor network of diseases by keeping only 10 neighboring diseases of the highest similarity scores for each disease. Similarly, given a functional similarity measure for genes, we could also construct a nearest neighbor network of genes by keeping only 10 neighboring genes of the highest similarity scores for each gene. These two networks, together with known associations between diseases and genes, formed a heterogeneous work that included both diseases and genes and immediately enabled us to adopt the following random walk model for prioritizing candidate genes.
In detail, such a heterogeneous disease-gene network is denoted by a triple H = (D, G, A), where D = (d ij ) m × m is the weight matrix of the disease subnetwork, G = (g ij ) n × n that of the gene subnetwork, A = (a ij ) m × n the adjacency matrix of the interconnections, and m and n the numbers of diseases and genes, respectively. Applying row-normalization to D, we obtain a transition matrix U = (u ij ) m × m , where u ij ¼ d ij = P m j¼1 d ij denotes the probability that a random walker moves from the i-th disease to the j-th disease when it stays in the disease subnetwork. Similarly, we obtain three other transition matrices: V = (v ij ) n × n with v ij ¼ g ij = P n j¼1 g ij denoting the probability that the walker moves from the i-th gene to the j-th gene when it stays in the gene subnetwork, R = (r ij ) m × n with r ij ¼ a ij = P n j¼1 a ij r ij ¼ 0 if P n j¼1 a ij ¼ 0 being the probability that the walker jumps from the i-th disease to the j-th gene when it stays in the disease subnetwork, and S = (s ij ) n × m with s ij ¼ a ji = P m j¼1 a ji s ij ¼ 0 if P m j¼1 a ij ¼ 0 being the probability that the walker jumps from the i-th gene to the j-th disease when it stays in the gene subnetwork. We then define matrix T as and perform row-normalization to obtain the transition matrix for the heterogeneous network as W = (w ij ) (m + n) × (m + n) , where w ij ¼ t ij = P mþn j¼1 t ij and τ the probability of jumping from the disease subnetwork to the gene subnetwork or vice versa.
be initial probabilities for the disease and the gene subnetworks, respectively. We obtain u (0) by assigning probabilities proportional to disease similarities to neighbors of the query disease and 0 otherwise, and we set v (0) to zeros to simulate the situation that genetic basis for the query disease is completely unknown. Let p (0) = ((u (0) ) T , (v (0) ) T ) T contains initial probabilities for the heterogeneous network and p (t) contains probabilities that the walker stays at each node at time t, we have the iterative formula p tþ1 ð Þ ¼ 1−π ð ÞW T p t ð Þ þπp 0 ð Þ : Solving this linear equation when time tends to infinite, i.e., p (∞) = (1 − π)W T p (∞) + πp (0) with respect to the steady-state probability p (∞) , we obtain the steady state solution p (∞) = π(I − (1 − π)W T ) −1 p (0) , which can be decomposed into a disease part u The later one, v (∞) , can then be used to score the strength of association between a query disease and candidate genes. It has been show that the random walk model is not sensitive to the parameters involved in the model [17]. We therefore set default values for the parameters as τ = 0.5, π = 0.7 and ε = 10 −4 .