Predicting candidate genes from phenotypes, functions and anatomical site of expression

Abstract Motivation Over the past years, many computational methods have been developed to incorporate information about phenotypes for disease–gene prioritization task. These methods generally compute the similarity between a patient’s phenotypes and a database of gene-phenotype to find the most phenotypically similar match. The main limitation in these methods is their reliance on knowledge about phenotypes associated with particular genes, which is not complete in humans as well as in many model organisms, such as the mouse and fish. Information about functions of gene products and anatomical site of gene expression is available for more genes and can also be related to phenotypes through ontologies and machine-learning models. Results We developed a novel graph-based machine-learning method for biomedical ontologies, which is able to exploit axioms in ontologies and other graph-structured data. Using our machine-learning method, we embed genes based on their associated phenotypes, functions of the gene products and anatomical location of gene expression. We then develop a machine-learning model to predict gene–disease associations based on the associations between genes and multiple biomedical ontologies, and this model significantly improves over state-of-the-art methods. Furthermore, we extend phenotype-based gene prioritization methods significantly to all genes, which are associated with phenotypes, functions or site of expression. Availability and implementation Software and data are available at https://github.com/bio-ontology-research-group/DL2Vec. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Understanding the molecular mechanisms underlying a set of abnormal phenotypes is important for diagnosis, prevention and development of therapies. Methods to identify and study these mechanisms include observational, experimental and computational approaches. In particular, in rare diseases, deciphering the mechanisms underlying a set of phenotypes is often limited due to small sample sizes. Computational methods that can reveal or model mechanisms in these diseases often rely on biological background knowledge.
Several computational methods have been developed to prioritize candidate genes for a particular disease or set of abnormal phenotypes (Feng, 2017;Guala and Sonnhammer, 2017;Tomar et al., 2019;Tranchevent et al., 2016;Zhang et al., 2018). Many such methods rely on identifying similarities between genes and suggest new candidates based on such a similarity (Gillis and Pavlidis, 2012). This similarity can be computed on several known features about a gene, including phenotype associations (Greene et al., 2016), distance within an interaction network (Peng et al., 2018) or functional similarity (Liu et al., 2018;Schlicker and Albrecht, 2010).
Phenotype-based methods have been particularly successful in finding candidate genes causing Mendelian diseases (Hoehndorf et al., 2011;Shefchek et al., 2020;Washington et al., 2009). Phenotype-based methods compare disease phenotypes to known genotype-phenotype associations and suggest candidate genes based on phenotype similarity measures (Kö hler et al., 2009). While these methods are successful, their main limitation is the incomplete knowledge of phenotypes that are associated with particular genotypes. One approach to overcome this limitation is the use of phenotype associations from model organism experiments together with ontologies that integrate phenotypes across different species (Bone et al., 2016;Hoehndorf et al., 2011;Smedley et al., 2013;Wang et al., 2017a;Washington et al., 2009). Although the use of model organisms expanded the scope of prioritizing candidate genes, there is only a limited amount of information about phenotype associations available for genotypes in model organism; furthermore, genes for which there are no orthologs in model organisms cannot benefit from cross-species phenotype-based gene prioritization approaches.
One possibility to overcome the limited information on genotype-phenotype associations is the use of prediction models that predict phenotypes, and efforts, such as the Computational Assessment of Function Annotation (Zhou et al., 2019) challenge regularly evaluate function and phenotype prediction models; while function prediction methods have increased significantly in performance and provide accurate predictions at least for some types of functions (Zhou et al., 2019), phenotype predictions still perform worse than function prediction methods (Jiang et al., 2016). Phenotypes arise from a genotype and interactions with the environment (Johannsen, 1911), and predicting the endophenotypes resulting from molecular aberrations requires the use of knowledge about molecular interactions as well as physiological interactions within and between cells, tissues and organs.
Logical axioms, as used in many phenotype ontologies to formally characterize and standardize phenotype descriptions (Gkoutos et al., 2018;Kö hler et al., 2019;Mungall et al., 2010), relate phenotypes systematically to biological functions and anatomical locations, and thereby integrate physiology, anatomy and abnormal phenotypes within a unifying formal framework (Gkoutos et al., 2018;Mungall et al., 2010;Shefchek et al., 2020). The axioms in phenotype ontologies rely on ontologies that can be applied across different species. In particular, biological processes, functions and cellular anatomy are described using the Gene Ontology (GO) (Ashburner et al., 2000), and anatomical sites are described using the UBERON anatomy ontology (Mungall et al., 2012); both ontologies are designed to integrate information across different species, and multiple large databases contain information that relate biological entities with classes in these ontologies. Phenotype ontologies therefore not only integrate background knowledge but can also be used to integrate data associated with different ontologies; in particular, they can be used to integrate functions of gene products, anatomical site or tissue of gene expression, and phenotypes resulting from a gene's loss of function.
Using ontologies and the background knowledge they contain in machine-learning models can significantly improve their performance (Smaili et al., 2020). Here, we developed an ontology-based machine-learning method to prioritize candidate genes based on abnormal phenotypes observed in mouse models, the normal functions of gene products and anatomical location of gene expression. Our method combines axioms in ontologies and annotations to ontology classes. We evaluate several machine-learning methods that utilize ontology axioms, and develop a novel graph-based method that overcomes several limitations of existing methods, in particular, when applying machine learning to different ontologies in which classes are not related mainly through subclass axioms but rather through other types of axioms. We demonstrate that our approach improves significantly compared to established phenotype-based gene prioritization methods, and further extends the application of these methods to all genes for which either their functions or their anatomical location of expression is known.

Ontology and annotation resources
We downloaded GO (Ashburner et al., 2000) annotations of 18 495 human gene products (495 719 annotations in total) from the GO website on March 20, 2020. We excluded the GO annotations where the evidence code indicated that the annotation was inferred from electronic annotation or for which no biological data are available (ND).
We obtained phenotype annotations for 13 529 mouse genes, including 228 214 associations between genes and MP (Smith and Eppig, 2009) classes, from the file MGI_GenePheno.rpt available at the MGI database (Smith et al., 2018). Phenotype associations were downloaded on March 20, 2020. We map each mouse gene to their human ortholog using the file HMD_HumanPhenotype.rpt available at the MGI database, resulting in 10,951 human genes where the mouse ortholog has phenotype associations.
We further downloaded the Tissue Expression Profiles (GTEx) dataset (Ardlie et al., 2015) from the Gene Expression Atlas (Papatheodorou et al., 2020) on March 20, 2020. GTEx characterizes gene expression across 53 tissues. We map the Ensembl protein identifiers to Entrez gene identifiers using the mapping provided by the Entrez database (Maglott et al., 2011). We set a threshold for whether a gene is expressed or not in a tissue by setting a cutoff of 4.0 transcripts per million; this threshold is determined experimentally (see Supplementary Fig. S4). Finally, we obtained 20 538 Entrez genes which have expression above this threshold in one or more tissue. We map each tissue to the Uberon Anatomy Ontology (Mungall et al., 2012), downloaded from the AberOWL ontology repository on March 20, 2020. We exclude the expression in EBVtransformed lymphocyte and transformed skin fibroblast, since these two tissues are not available in the UBERON ontology.
The PhenomeNET Ontology (Rodríguez-García et al., 2017) is a cross-species ontology, which integrates multiple species-specific phenotype ontologies as well as related ontologies, such as the GO and the Uberon Anatomy Ontology. We downloaded the PhenomeNET ontology from the AberOWL ontology repository on March 20, 2020.

Evaluation datasets
We obtain associations between 2542 human diseases and 2885 genes from the file MGI_DO.rpt available at the MGI database, downloaded on March 20, 2020; the dataset contains 4051 genedisease associations in total, where diseases are represented using their OMIM identifier (Amberger et al., 2011).
As our gene-phenotype associations are to mouse genes (resulting from a loss of function of that gene) while our evaluation set uses human gene identifiers, we need to identify human orthologs of the mouse genes. We identify the mouse orthologs of human genes, and human orthologs of mouse genes, using the file HMD_HumanPhenotype.rpt at the MGI database, downloaded on March 20, 2020. Supplementary Table S1 summarizes our training and evaluation data. For each type of feature, there is a different number of associated genes, and consequently a different number of gene-disease associations we can identify; most disease-associated genes have features in all three datasets.
We use functional interactions between proteins obtained from the STRING database (Szklarczyk et al., 2019) on March 1, 2020. The interaction dataset contains 19 355 proteins and 11 759 455 edges between them. We mapped the proteins to the UniProt database and filter out those entries that did not map to the UniProt database. Further, STRING provides a confidence score for an interaction and we only keep interactions with a confidence of at least 700. The remaining interaction network consists of 17 178 proteins with 840 672 interactions.

Embedding methods
Onto2Vec (Smaili et al., 2018) is a method to learn the semantic embedding representations of biological entities and by extracting features from ontology-based annotations, axioms and ontology structures. It directly utilizes the axiom features and also indirectly infers new logical axiom features by applying the HermiT OWL reasoner (Motik et al., 2009). Onto2Vec collects data and axioms as 'sentences' and uses a skip-gram model to learn the vector representation for each word. OPA2Vec (Smaili et al., 2019) is an extension of Onto2Vec, which includes the annotation axioms in ontologies and uses transfer learning to assign them a semantics.
A random walk of length k on a graph G ¼ ðV; EÞ is a sequence of nodes and edges n 1 ; e 1 ; . . . ; e kÀ1 ; n k such that for all i, 1 i < k; e i ðn i ; n iþ1 Þ 2 E and n iþ1 was chosen randomly from all neighbors of n i . Here, we also include edge labels in the walk. For the purpose of selecting neighboring nodes, we treat the graph as undirected. We generate 80 walks from each node, and stop the walks after 20 steps. When including functional interactions between proteins in our knowledge base, we increased the number of walks to 200 and stop the walks after 30 steps to account for the higher node degree. We implement our embedding algorithm in a software called DL2Vec and make the source code, together with our experiments, freely available under the GNU General Public License version 3.

Word2Vec model
Word2Vec (Mikolov et al., 2013) is a language model for learning vector representations of words based on co-occurrence within a context window. We use the skip-gram model of Word2Vec (Mikolov et al., 2013). Given a sentence with N words, the skipgram model reads the sentence with a window kernel size c and maximizes the co-occurrence probability of words that appear in the same window.
We apply the SkipGram algorithm on our node and edge sequence corpus, which is generated by a random walk on the heterogeneous graph. We set the skip-gram parameters to a window size of 10, and min_count value to 1. The training process iterates 20 times, and it outputs a 200 dimensional embedding for each entity.

Pointwise learning-to-rank prediction model (,)
We use a pointwise learning-to-rank model to predict associations between genes and diseases. The model takes two vectors V 1 and V 2 as input for two independent neural networks 1 and 2 . We then calculate the inner product of 1 ðV 1 Þ and 2 ðV 2 Þ and use a sigmoid function to obtain a similarity score between V 1 and V 2 . We train this model using binary cross-entropy as loss function. Each neural network 1 and 2 consists of two hidden layers, the first with 256 neurons and the second with 50. We use 20% dropout (Srivastava et al., 2014) after each layer, followed by a LeakyReLU (Maas et al., 2013) as the activation function. The model parameters are optimized using the Adam (Kingma and Ba, 2014) optimizer.

Evaluation
We use the ROC curve (Fawcett, 2006) to assess the performance of our classification model. The ROC curve is a plot of the true positive rate as a function of the false positive rate. To compute true positive and false positive rate, we rank all genes for each disease, and compute the average true and false positive rates at each rank. We then generate the ROC curve, and compute the ROCAUC, as the averages across all diseases. We also report the recall at rank n (Hits@n).
We compute differences in the ROCAUC using the nonparametric Mann-Whitney U test (Nachar, 2008). For the test, we test the significance of ranking true positive associations differently between two prediction models. We consider differences as significant if P < 0.05. In order to compare the performance of the embeddings generated from phenotypes (MP), gene expression (UBERON) and biological functions (GO) directly, we focus on genes which have annotations to all three ontologies as evaluation set; the number of genes that have annotations in all three ontologies is 9886.

Phenotype-based prioritization of candidate genes
We developed a method based on deep learning-to-rank candidate causative genes given a set of abnormal phenotypes that characterize a genetically based disease. We prioritize, or rank, genes based on three distinct types of features that can be associated with a gene: phenotypes associated with the gene's orthologs in the mouse; the functions and cellular locations of the gene products for which the gene encodes; and the anatomical locations at which the gene is expressed. Each of these features is expressed using biomedical ontologies and we use the ontology as part of the learning problem.
For this purpose, we first embed the information about genes and diseases together with the ontologies used to characterize them in a vector space and then use a supervised machine-learning model to predict whether a gene is causative of a set of phenotypes or disease.
Specifically, we obtain the annotations of human genes with functions and cellular locations encoded by the GO (Ashburner et al., 2000) from the GO Annotation database (Huntley et al., 2015), their anatomical site of expression in functional genomics experiments (Ardlie et al., 2015) encoded using the UBERON anatomy ontology (Mungall et al., 2012), and the phenotypes of their mouse orthologs from the Mouse Genome Informatics (MGI) database (Smith et al., 2018) and characterized using the Mammalian Phenotype Ontology (MP) (Smith and Eppig, 2009). Furthermore, we obtain phenotype annotations of human diseases with the Human Phenotype Ontology (HPO) (Robinson et al., 2008)  We 'embed' each gene and disease, their ontology-based annotations, and the ontologies used in the annotations, in a vector space. An embedding is a function from gene or disease identifiers, and from entities in ontologies, into a real-valued vector < n of size n (with n being a parameter of the embedding) such that some properties of the ontologies are preserved in < n . Initially, we use the Onto2Vec (Smaili et al., 2018), OPA2Vec (Smaili et al., 2019), methods to generate the embeddings as they have performed well in similar tasks before. We also use SmuDGE (Alshahrani and Hoehndorf, 2018), which generates feature vectors for entities represented in a knowledge graph and encodes for (parts of) the knowledge contained in ontologies. We generate embeddings individually using phenotype, GO and UBERON annotations; because these annotations are available for different numbers of genes, we also generate a set of embeddings based on the union of all genes and their annotations (i.e. for genes that have annotations from one, two or all three datasets) as well as another set of embeddings only for genes that have annotations from all three sources.
We use a pointwise learning-to-rank model (see Section 2 and Supplementary Fig. S1), to prioritize gene-disease pairs based on gene-disease associations in the Online Mendelian Inheritance in Men (OMIM) database (Amberger et al., 2011). Our model is based on neural networks; given a pair of embedding vectors G and D as input, the model independently transforms the embeddings into a lower-dimensional representations using two fully-connected hidden layers, and then computes the inner product followed by a sigmoid function that outputs a value between 0 and 1, and which we use as the prediction score for an association between G and D.
We train and test our model based on 10-fold cross-validation; in each fold we split our data by the disease (and not by the genedisease association pair) to ensure that the diseases on which we test have not been seen during training. Within each split, we use 10% of the data as the testing data used to report the final results of our model, and we use the other 90% data to train the model and tune its parameters; within these 90% of training data in each fold, we use a randomly chosen set of 90% for training and 10% for validation. We use sub-sampling of 'unknown' associations between genes and diseases to generate negative associations for each disease; we sample 20 negatives for each positive association. We then use binary cross-entropy as the loss function to optimize the ranking model and use the Adam optimizer (Kingma and Ba, 2014) to train our model.
For the evaluation of our learning-to-rank model, we rank all genes for each disease based on their prediction score (within the testing set). We then use the receiver operating characteristic (ROC) curve (Fawcett, 2006) and the area under the ROC curve (ROCAUC) to evaluate how the known positive gene-disease pairs rank among all the possible pairs. Supplementary Figure S2 shows the ROC curves for our prediction model when using different embedding methods, and Table 1 summarizes the results of the cross-validation. We find that we can identify causative genes best when using phenotypes, while the predictive performance decreases when using features derived from gene functions and anatomical site of gene expression.
We hypothesize that one of the reasons for the observed difference in predictive performance between the different data types is the inability of Onto2Vec and OPA2Vec to capture longer distance dependencies through which phenotypes, functions and anatomical locations are connected within the PhenomeNET ontology. In particular, Word2Vec is equivalent to factorize a matrix, which contains the pointwise mutual information (Church and Hanks, 1990) of words within a context window (Levy and Goldberg, 2014), and this measure is only based on directly co-occurring tokens (within the context window considered by Word2Vec). When using Onto2Vec or OPA2Vec, genes and diseases will only directly cooccur with the ontology classes used to characterize them (i.e. phenotypes, GO functions and UBERON anatomical classes for genes, and phenotypes for diseases), as well as all their superclasses (because Onto2Vec and OPA2Vec compute the transitive closure over the subclass hierarchy and add them to the set of asserted axioms). Consequently, even if a phenotype class is defined based on an anatomical location or a function, this anatomical location or function class will not co-occur with a gene or disease that is associated with this phenotype. For example, the class Ventricular septal defect (HP: 0001629) is defined as an incomplete closure of the Interventricular septum (UBERON: 0002094), which in turn is constrained to be a part of the Heart (UBERON: 0000948) in UBERON. When embedding genes based on their anatomical site of expression (i.e. using the UBERON ontology) and diseases based on their phenotypes, Onto2Vec and OPA2Vec will only add subclass relations as directly co-occurring tokens to use in the embedding but not the classes that are linked indirectly through axioms.
We hypothesize that incorporating these indirect associations will allow us to better utilize the background knowledge contained in the ontologies and further improve predictive performance, and we develop a novel embedding approach for ontologies that aims to improve the embedding of ontologies with many complex axioms, as well as embeddings of entities, which are annotated with classes that do not stand in a subclass relation but are related through more complex axioms.

Embedding graph-based representations of ontologies
Our novel embedding approach is inspired by the OWL2Vec (Holter et al., 2019) as well as the Walking RDF & OWL (Alshahrani et al., 2017) and SmuDGE (Alshahrani and Hoehndorf, 2018) methods, which first convert ontologies into a graph based on syntactic patterns within the ontology axioms, and then apply a knowledge graph embedding (Wang et al., 2017b) on the resulting graph. However, our method extends these approaches to incorporate more complex forms of axioms into the generated graph so that the complexity of the axioms in a cross-species phenotype ontology, such as PhenomeNET (Rodríguez-García et al., 2017) can be utilized.
We have defined a transformation function (shown in Table 2) that is used to convert ontology axioms in the Web Ontology Language (OWL) (Grau et al., 2008) format into a graph. The transformation function considers logical operators as well as quantifiers, and converts them into edges (or subject-predicate-object triples) of a graph. The function is applied to all logical axioms in an ontology, determines whether the precondition or preconditions of the Note: The intersection represents embeddings generated jointly from all three types of ontologies and associations, limited to genes that have associations to all three ontologies, while union represents embeddings generated jointly from all three types of ontologies and associations, limited to genes that have associations in at least one of the three ontologies. For ROCAUC, we report the intervals obtained from cross-validation. The results 'without PPI' use a graph based on ontology axioms and associations between genes/diseases and ontology classes; the results 'with PPI' also include functional interactions between genes/proteins as part of the graph. Best performing results are highlighted in bold. function are satisfied for the axiom, and if they are satisfied it adds one or more edges to the graph.
Our transformation considers axioms pertaining to classes (the ontology, or TBox). Associations between a gene G and an associated ontology class C can be modeled in OWL as an axiom G SubClassOf: has-function some C (or using some other relation instead of has-function) and, as a consequence, a direct edge between G and C will be created through our algorithm. We convert all axioms from the PhenomeNET ontology, and the annotations of gene and disease entities with their ontology classes, into a graph representation using the transformation function in Table 2. After generating the graph, we apply iterated random walks starting at nodes of the graph to generate a corpus, and use Word2Vec (Mikolov et al., 2013) to generate embeddings for nodes and edge labels based on this corpus.
We repeat our supervised training process using our novel embedding method. The results are summarized in Table 1 (lefthand side, 'without PPI') and ROC curves for this task shown in Supplementary Figure S2. While the performance in predicting gene-disease associations using only phenotype annotations is comparable to the predictive performance observed when using Onto2Vec and OPA2Vec, we observe a significant improvement in ROCAUC when using features encoded using GO (p ¼ 2:02 Â 10 À125 , Mann-Whitney U test) and UBERON (p ¼ 2:77 Â 10 À152 , Mann-Whitney U test), indicating that our approach can better capture relations between classes that are related through complex axioms instead of only subclass axioms. The SmuDGE and OWL2Vec methods are more similar to our approach and their performances are closer to our method; however, we still improve over both SmuDGE and OWL2Vec when using only GO and UBERON as features. We also reports recall (hits) at ranks 1, 10 and 100.

Adding network information
Since our embedding approach is based on graphs and random walks, it can naturally accommodate other graph-structured information in addition to the graph generated from the ontology axioms. There are many biological networks that also relevant to understand gene-disease associations (Alanis-Lobato et al., 2016;Al-Harazi et al., 2016;van Dam et al., 2018), in particular interaction networks. To determine whether our method is able to utilize this information, we conduct another experiment in which we add functional interactions between proteins obtained from the STRING database (Szklarczyk et al., 2019) to the knowledge base. We add the interactions to our graph as direct interacts-with edges between genes (or, equivalently, for an interaction between proteins P 1 and P 2 , we add the axiom P 1 v9interacts À with:P 2 to the knowledge base and convert them according to the conversion rules in Table 2), and then we repeat our workflow and predict associations between genes and diseases based on the new embeddings (which now also contain information about interactions between genes/proteins in addition to the associations with ontology classes as in the previous experiment). The results of this experiment are shown in Table 1 (right-hand side, 'with PPI') and Supplementary Figure S3, which shows the overall performance obtained from our method using network information and its comparison with embeddings based on other methods.
We find that our workflow results in the best-performing model with regard to ROCAUC, in particular, when comparing the embeddings generated using ontologies of different domains, such as when comparing diseases (characterized with phenotypes) and genes characterized by their function or anatomical site of expression.

Discussion
We designed a novel method to prioritize candidate genes given a set of abnormal phenotypes associated with a genetically based disease; our method uses information about genes obtained from animal model phenotypes, the functions of gene products, the anatomical location of gene expression and interaction networks, as well as a large amount of background knowledge contained in biomedical ontologies. Our method improves over other phenotype-based methods in several ways.
First, we use a pointwise learning-to-rank machine-learning model, which improves the predictive performance when evaluated using gene-disease associations from the OMIM (Amberger et al., 2011) database; our model is designed to directly learn the similarities between two embeddings and results in improved predictive performance when compared to other models (Smaili et al., 2018(Smaili et al., , 2019 used to predict gene-disease associations based on embeddings. Second, we developed a novel method to exploit complex axioms by converting them into a graph and relying on graph embeddings; we show that this approach improves performance significantly when embedding multiple ontologies that are only linked through complex axioms. This advance is particularly important in ontologies that are heavily formalized using OWL and that are interlinked, such as the ontologies in the collaborative OBO Foundry effort (Smith et al., 2007). For example, using DL2Vec, we are able to prioritize the association between a Mendelian form of cataract (OMIM: 604 307) and the gene CRYGC within the first two ranks when incorporating the GO, whereas OPA2Vec and Onto2Vec rank this gene below rank 1000. One of the key phenotypes of cataract is visual impairment (HP: 0000505), which is defined, in the HPO, as a decreased visual perception (GO: 0007601); based on this formal definition, DL2Vec creates an edge between visual impairment and visual perception. The gene CRYGC is associated with the GO class visual perception. When performing the iterated random walks from either the disease node or the gene node, we find that multiple walks use this edge and therefore lead to a direct co-occurrence of both the disease and the gene with the nodes representing visual impairment as well as visual perception; applying Word2Vec on these walks results in the gene embedding and disease embedding to become more similar to each other and allows DL2Vec to prioritize the association at one of the top ranks.
Third, our method prioritizes candidate genes for a set of abnormal phenotypes using a combination of gene expression, function, network, phenotype data and ontologies. In contrast to methods that rely on knowledge about disease-associated genes in order to prioritize new candidates, the input to our method are only the Note: Q represents an arbitrary quantifier or cardinality restriction, A and B i represent arbitrary class names and R i represents arbitrary relation names. Our algorithm iterates through all axioms in O and determines whether the conditions are satisfied; if the condition or conditions are satisfied, the corresponding triple or triples are added to the resulting graph. For example, 0 feedingbehavior 0 vbehavior will result in the triple h 0 feedingbehavior 0 ; SubClassOf; behaviori being added. The first rule captures more complex axioms where multiple relations could be used as part of the axiom and D is either a class name or a complex class description (union or intersection); in the latter case, multiple triples are added to the resulting graph. phenotypes observed in a patient. In our approach, prioritization of candidate genes does not rely on knowledge (or existence) of other genes associated with the same phenotypes. We achieve this by combining the different annotations on two distinct levels: first, the different annotations (phenotype, function, expression) are combined on the level of a gene or gene product (which we do not distinguish), so that a single entity (the gene and its products) is associated with all three types of information; second, we also utilize the links between ontologies directly. The links between the classes in ontologies allow us to establish new relations between the different features associated with genes, and these features are not accessible without utilizing the ontology axioms. This makes our approach applicable to Mendelian disease for which no genes may be known to be associated (or where only a single gene is associated), and where features of known disease-associated genes could not be used to identify novel causative genes. While approaches based on the guiltby-association principle generally perform well on diseases or phenotypes with several known associated genes (Chen et al., 2009;Gillis and Pavlidis, 2012;Schlicker and Albrecht, 2010;Singleton et al., 2014;Tranchevent et al., 2016), our method has a broader range of application.
Fourth, while there are several phenotype-based methods that are applied widely for prioritizing candidate genes (Cornish et al., 2018;Kö hler et al., 2009;Smedley et al., 2013), they are limited to genes with associated phenotypes. As there are only a limited number of human genes with associated phenotypes, this set of genes can be expanded significantly by incorporating phenotypes of human orthologs in animal models (Smedley et al., 2013); however, even using animal model phenotypes will leave about half of human genes without any phenotype associations, either due to lack of phenotype associations in animal models or due to the absence of orthologs for a human gene (Shefchek et al., 2020). We significantly expand phenotype-based gene prioritization methods to genes that have either phenotype associations, are associated with GO functions, or have known sites of expression. While the predictive performance of our method is lower for genes that do not have phenotype associations than for genes with associated phenotypes, we show that we can nevertheless identify disease-associated genes by comparing phenotypes to gene functions or to anatomical locations. Additionally, our model is extensible and can include additional features if they can be encoded using ontologies. For example, we can expand our model using gene expression in individual celltypes, using the Celltype Ontology (CL) (Bakken et al., 2017). We experimented using single-cell RNAseq data from the Tabula Muris project (The Tabula Muris Consortium et al., 2018) in which genes are annotated with the CL. From this dataset, we obtain 17 149 associations between genes and one or more classes from CL. We added the CL annotation of genes as well as the disease phenotype annotations and performed the same experiments as for the other three ontologies. Without including the functional interactions between genes, we obtain a ROCAUC of 0.906 (0:883 À 0:949) for predicting gene-disease associations (Hits@1, Hits@10 and Hits@100 are 0.037, 0.299 and 0.634, respectively). These results show that single-cell gene expression can provide more information for predicting gene-disease associations than tissue-level gene expression encoded using Uberon. One key limitation in using celltype-specific gene expression is that CL is used in fewer axioms within phenotype ontologies (compared to UBERON or GO), and therefore our method will not exploit relations between phenotypes and celltypes as well as relations between the other ontologies.
Our method still has several limitations. Our conversion from OWL into a graph does not consider all OWL axioms, and the conversion also treats different types of restrictions and axiom types identically although their semantics is different. In the future, we plan to extend the method to convert any OWL axioms into a graph representation, relying, e.g. on relational patterns defined in the OBO Relation Ontology (Smith et al., 2007), and also rely on inferred axioms for generating the graph, such as implemented in the Onto2Graph method (Rodríguez-García and . Another major limitation of our approach is that it is inherently transductive and not inductive. In particular, the diseases with their phenotype associations must be known in our workflow before generating embeddings and training our prediction model, and it is not straightforward to apply the approach to a new set of phenotypes (such as the phenotypes observed in an individual). This limitation is shared by many graph embedding and knowledge graph embedding approaches (Wang et al., 2017b). However, this limitation can be overcome either through the use of inductive methods for learning on knowledge graphs, such as graph neural networks (Kipf and Welling, 2016;Scarselli et al., 2008), or by including patients with their phenotypes as part of the original data (or graph), training the model on gene-disease associations and applying it to predict candidate genes for the patient nodes. However, extending our approach to an inductive setting will allow for an easier combination of our approach with methods to find pathogenic causative variants based on observed phenotypes and next-generation sequencing data (Boudellioua et al., 2017;Robinson et al., 2014).
Finally, we treat all genes that are not known to be associated with the disease as negatives and consequently have many more negative than positive associations. This has two consequences; first, we may incorrectly classify an association as negative when a gene is associated with the disease but this association is not yet known. Second, while the overall predictive performance of our method improves over the state-of-the-art and ROCAUC is usually above 0.9 in our evaluation, the recall at the first ranks is still low and rarely exceeds 5% at the first rank. The reason for this difference between the evaluation measures is the imbalanced dataset we use for evaluation, where all genes not known to be associated with a disease are considered negative for that disease. Our evaluation therefore does not consider any additional knowledge about potential associations between a gene and disease. However, in a realistic scenario in which new genes are evaluated for their association with a Mendelian disease, more information is usually available, either from evaluating the pathogenicity of variants found in affected individuals, filtering by pedigree and mode of inheritance, or filtering by variants found in unrelated individuals with the same phenotypes; after such a workflow, usually <100 genes remain as potential candidates (Alfares et al., 2020) (in contrast to 9886 in our evaluation), and recall at top ranks will improve.

Conclusions
We developed a method for prioritizing candidate genes given a set of phenotypes associated with a disease. Our method can utilize different types of features characterized through ontologies, and significantly improves the phenotype-based prediction of disease genes. While previous phenotype-based gene prioritization methods are only applicable when phenotype associations are known for genes, our method can be applied to a much larger number of genes for which either functions, sites of expression, phenotypes or interactions with other genes are known.