The Alzheimer's Knowledge Base - A knowledge graph for therapeutic discovery in Alzheimer's Disease research

Background: As global populations age and become susceptible to neurodegenerative illnesses, new therapies for Alzheimer’s Disease (AD) are urgently needed. Existing data resources for drug discovery and repurposing fail to capture heterogeneous biomedical knowledge that is central to the disease’s etiology and response to drugs. We designed the Alzheimer’s Knowledge Base (AlzKB) to alleviate this need by providing a comprehensive knowledge representation of AD etiology and candidate therapeutics. Objective: We designed the Alzheimer’s Knowledge Base (AlzKB) to alleviate this need by providing a comprehensive knowledge representation of AD etiology and candidate therapeutics. Methods: We designed AlzKB as a large, heterogeneous graph knowledge base assembled using 22 diverse external data sources describing biological and pharmaceutical entities at different levels of organization (chemicals, genes, anatomy, diseases, etc.). AlzKB uses an OWL 2 ontology to enforce semantic consistency and allow for ontological inference. We provide a public version of AlzKB and allow users to run and modify local versions of the knowledge base. Results: AlzKB is freely available at http://alzkb.ai, and currently contains 118,902 entities with 1,309,527 relationships between those entities. To demonstrate its value, we use graph data science and machine learning to (a.) propose new therapeutic targets based on similarities of AD to Parkinson Disease and (b.) repurpose existing drugs that may treat AD. For each use case, AlzKB recovers known therapeutic associations while proposing biologically plausible new ones. Conclusions: AlzKB is a new, publicly available knowledge resource that enables researchers to discover complex translational associations for AD drug discovery. Through two use-cases, we show that it is a valuable tool for proposing novel therapeutic hypotheses based on public biomedical knowledge.


Table of Contents
1) Would you like to publish your submitted manuscript as preprint?Please make my preprint PDF available to anyone at any time (recommended).
Please make my preprint PDF available only to logged-in users; I understand that my title and abstract will remain visible to all users.
Only make the preprint title and abstract visible.
No, I do not wish to publish my submitted manuscript as a preprint.2) If accepted for publication in a JMIR journal, would you like the PDF to be visible to the public?
Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended).
Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain v Yes, but only make the title and abstract visible (see Important note, above).I understand that if I later pay to participate in <a href="http

Original Manuscript 1 Introduction
Alzheimer's Disease (AD) is a progressive, neurodegenerative disease affecting an estimated 6.5 million Americans aged 65 and older, and represents a significant clinical, economic, and emotional burden worldwide.
[1] AD is often cited as one of the greatest healthcare problems of the 21 st century, particularly in developed nations with an increasing proportion of older adults.Despite its societal impact, effective pharmaceutical treatments for AD remain notoriously elusive.The US FDA has approved 5 drugs for the treatment of AD, 4 of which (donepezil, rivastigmine, galantamine, memantine) only temporarily treat symptoms but do not alter overall progression of the disease, [2] while the fifth (aducanumab) is highly controversial in terms of evidence for effectiveness and its safety profile.[3] AD researchers prioritize the discovery and approval of new therapies for the disease, both in terms of newly discovered compounds and by repurposing drugs that are already approved to treat other (non-AD) human diseases.AD has been associated with substantial changes in pathology, including the presence of neuritic plaques associated with amyloid-beta protein, extracellular deposition of amyloid-beta, and neurofibrillary tangles mostly composed of tau protein.Previous research has shown that these neuropathological changes begin to occur years before clinical symptoms are apparent.[4,5] Despite decades of research inquiry, why this pathology begins to develop -that is, the pathogenesis of ADis largely unknown.[6] Current consensus is that AD risk is multifactorial.The most well-established risk factors include age, family history, and certain genetic factors, especially the presence of the 4 allele of APOE.However, the exact mechanism by which several of these factors-including APOE-4 presence-cause or contribute to AD risk is unknown.[7] Of the many techniques employed in AD therapeutics research, there is a wealth of computer-aided approaches that leverage recent advances in bioinformatics, epidemiology, artificial intelligence, and machine learning.For example, Rodriguez et al. developed a machine learning framework to assess gene lists constructed by differential gene expression data in response to drug treatment to determine whether those drugs would be candidates for repurposing in Alzheimer's disease.[8] Tsuji et al. used an autoencoder neural network to perform dimensionality reduction of a high-density protein interaction network in order to identify new possible drug targets and then found drugs associated with those targets.[9] Genome-wide association studies (GWAS) have long been used for the identification of genes that confer AD risk, particularly for rare genes or genes with small (but statistically significant) contributions to disease risk.[10] In this study, we describe the design and deployment of a major new knowledge resource for computational AD research-named The Alzheimer's Knowledge Base (AlzKB)-with a particular focus on drug discovery and drug repurposing.At its core, AlzKB consists of a large, heterogeneous graph database describing entities related to AD at multiple levels of biological organization, with rich semantic relationships describing how those entities are related to one another.We hypothesize that these relationships contain valuable knowledge that cannot be effectively captured in existing data resources, with the additional advantage of improving explainability of new predictions.To support this hypothesis, we also present two data driven analyses involving machine learning on AlzKB's knowledge graph: 1.) Predicting Parkinson's Disease genes that may also be associated with AD, and 2.) generating and explaining drug repurposing hypotheses for treating AD, both of which replicate existing knowledge while proposing entirely novel directions for future experimental validation.AlzKB is free, open source, and publicly available online. [11]

Methods
AlzKB is a large, heterogeneous, graph-formatted knowledge base-accompanied with a suite of tools for interacting with and making discoveries from the knowledge base-containing entities related to AD etiology and treatments, as well as the semantically meaningful relationships that link those entities.

AlzKB Ontology
Graph databases are renowned for their flexibility in representing data that does not conform to a rigid, tabular structure, but this comes at the expense of implicitly enforcing consistency and semantic standardization.[12] To mitigate this issue, we designed an OWL 2 ontology-describing the types of entities relevant to AD and treatment of AD, as well as the types of relationships that link those entities-that serves as a 'template' for nodes and edges in the knowledge graph.Briefly, since many of the components of a graph database have a one-to-one correspondence with components of an OWL 2 ontology (e.g., OWL 2 classes are equivalent to graph database node labels, OWL 2 object properties are equivalent to edge types in a graph database, etc.), it is possible to populate the ontology using biomedical knowledge and translate the contents of the populated ontology into an equivalent graph database.Therefore, enforcing consistency in the ontology becomes equivalent to enforcing consistency in the graph database.
We constructed the ontology manually using the Protégé ontology editor (v.5.5.0), [13] following an iterative process guided by expert domain knowledge.First, we prototyped a class hierarchy containing the types of nodes (e.g., Gene, Disease, Pathway, Drug) desired in the knowledge base.
We then annotated these classes with data properties (node properties) and object properties (relationship types).A thorough description of the components of OWL 2 ontologies is given by Hitzler et al. [14] Finally, we placed restrictions on the ontology to reflect biology and clinical practice.For example, there are restrictions stating that all pathways contain one or more genes, or that all drugs in the knowledge base must have a valid DrugBank ID.We repeated these steps several times, making revisions on previous iterations until several domain experts agreed the semantic contents of the ontology are consistent with current AD knowledge and systems biology processes involved in AD etiology.

Collecting and assembling third-party data sources
Using the class hierarchy of the AlzKB ontology as a starting point, we determined a set of the most important entity types to include in the first release of the knowledge base.For example, we prioritized inclusion of entities representing Diseases (specifically AD and its various subtypes), Genes, and Drugs, among others.Similarly, we identified important relationship types (e.g., "DRUG_BINDS_GENE" or "GENE_ASSOCIATED_WITH_DISEASE") to include in the knowledge base.For each of these entity types and relationship types, we identified a third-party, public data source that would serve as a collection of "ground truth knowledge" for that entity or relationship type.In the assembled knowledge base, there is roughly a 1-to-1 correspondence between a data record in the original data source and its corresponding entity/relationship in AlzKB.

Implementing AlzKB
After populating the AlzKB ontology with entities, relationships, and data properties drawn from third-party data sources, we serialized the ontology into the RDF/XML graph data format.A complete list of the data sources used in AlzKB at the time of writing is provided in Table 1.We then populated a Neo4j graph database (v4.4.5)[15] with the contents of the RDF/XML file using the neosemantics library.[16] Finally, we stripped the newly populated graph database of unnecessary artifacts that are components of the OWL 2 standard, leaving only nodes, relationships, and properties defined within the hierarchy.For the publicly hosted version of AlzKB, we created a web server that hosts both the static AlzKB website (containing information, documentation, and usage details) and the Neo4j graph database available by navigating to a subdomain[17] of the main website. [11]

Knowledge base description
The first release of AlzKB (v1.0)[60] contains 118,902 distinct nodes and 1,309,527 relationships linking those nodes.A full summary of node types and relationship types with counts, respectively, are given in Table 2 and Supplemental Table S1.Users can interact with AlzKB in their web browser using the Neo4j interface, or programmatically by connecting to the graph database over the internet.We also provide instructions for installing a local copy of the graph database, as well as how to build the database from its original data sources.

Proposing new therapeutic targets for AD
As a proof of concept, we performed an analysis to predict whether known Parkinson Disease (PD) genes are also linked to AD etiology.A growing body of work has established physiological similarities between PD and AD, and it has been proposed that drugs targeting PD genes could potentially treat AD as well.To do this, we defined a binary classification task to predict whether gene nodes in the AlzKB knowledge graph are or are not AD genes.
[61] We trained a random forest (RF) classifier using the following topological graph features, which are computed for every node pair in the graph (regardless of whether an edge does or does not exist between them): Common Neighbors (CN), Total Neighbors (TN), Preferential Attachment (PA), Adamic-Adar (AA), and Resource Allocation (RA).[62-65]Each feature gives a different measure of network 'relatedness' for a pair of nodes, which are then used as features in the predictive model.For a given node pair ( n 1 , n 2 ) , these metrics are defined as follows: where N ( n i ) is the set of neighbor (adjacent) nodes of node i.To assemble the dataset, we considered all gene nodes adjacent to AD positive (n=101) and all gene nodes not adjacent to AD negative (n=62,306).The negative samples are assumed to contain a mixture of true negatives and false negatives; in link prediction tasks the goal is to recover the false negatives.We further filtered the negative nodes to omit PD genes (n=73) and orphan gene nodes (n=43,032), and downsampled the remaining genes to 303 (i.e., 3 times the number of positive samples).Our training procedure for the random forest model includes 3-fold grid search cross validation to optimize hyperparameters, 80%/ 20% train/test split, and repeating the procedure 10 times with random sampling.To evaluate the performance, we used accuracy, balanced accuracy, precision, recall, F1-score, AUROC, and AUCPR, as shown in Figure 2. The RF model predicted gene-disease relationships with an average balanced accuracy of 96.2% (precision = 0.88, recall = 0.98).We applied the trained models to predict PD genes that are likely to also be AD genes.Among the 73 PD genes in AlzKB, 8 genes (FYN, DCTN1, SNCA, SYNJ1, RSP12, ATXN2, KCNIP3, and CHRNB1; described in Table 3) were predicted to be AD genes.7 of the genes were predicted to be AD genes in all 10 models, while CHRNB1 was predicted in 7 of the 10 models.Voltage-gated potassium channel-interacting protein that is critical to neuronal excitability.

Drug repurposing via graph data science
As a second use-case, we considered the task of repurposing existing drugs -currently used to treat other diseases -based on patterns in the knowledge graph that suggest they may also treat AD.To do this, we trained 5 state-of-the-art knowledge graph completion methods (TransE, RotatE, DistMult, ComplEx, and ConvE)[66] on AlzKB and selected the highest performing of them to predict links between drugs and AD.Additional detail about the difference between these methods is provided in Supplemental Information.These models learn low-dimensional representations of graph nodes as vector embeddings.The embeddings are then combined to propose all possible triples in the graph (source node, edge, target node) and scores are generated to indicate plausibility of the triple.We implemented the 5 models using PyKEEN -a Python library for knowledge graph embeddings.
[67] We randomly split the dataset of all triples into 80/10/10 training/validation/testing sets and used grid search to empirically set embedding dimensions to 256 and the number of epochs to 100 with early stopping allowed.All remaining hyperparameters were set to the PyKEEN defaults.We trained the models on Google Colab using a single Tesla T4 GPU and evaluated the results using the rankbased evaluation metrics Hits@k (k=1, 3, and 10) and Mean Reciprocal Rank (MRR).[68]Rankingbased evaluation sorts the scores of triples in descending orders and sets their rank as the index in the sorted list.In the case of multiple 'true' triples having an equal score, we used the average of the most optimistic (best) and pessimistic (worst) ranks across the metrics.Briefly, Hits@k is the ratio of true triples in the test set that have been ranked within the top k predictions of the model.Higher values indicate better performance.MRR, also known as inverse harmonic mean rank, is the arithmetic mean of the inverse rank of the true triples.We performed evaluation on both left-and right-side predictions, i.e., how well they can predict missing entities in partial triples without either the head (source) or tail (target) entities.

Model name
Hits@1 Hits@3 Hits@  5 along with their current approved usage and relevant clinical trial status pertaining to AD efficacy.Among the top 10 predictions, 3 have been investigated in clinical trials to treat symptoms of AD.To further explore these predictions, we generated visualizations of a minimum spanning tree linking the 10 drugs to AD in AlzKB's knowledge graph, as shown in Figure 3.The visualization shows that the shortest paths between the drugs and AD are mediated by a small set of AD-associated genes, each of which is associated with one or more of the proposed drugs.The visualization is suggestive of interpretable biological mechanisms by which the diseases could act on AD etiology, and provides hypotheses to further explore their validity.5) to AD. Blue nodes are drugs, pink nodes are genes, and the orange node is AD.Genes on the shortest path between a drug and AD can be considered putative mechanistic explanations for how the drug may act on AD etiology.

The role of AlzKB in biomedical knowledge discovery
AD and other neurodegenerative diseases present one of the greatest challenges in modern biomedicine.AD is by-and-large a disease of old age, and as improvements to healthcare continue to increase the overall global life expectancy, we can expect the number of people with various forms of dementia to also increase.Since the etiology and pathophysiology of AD are highly multifactorial, there is likely no single 'cure' for the disease.Instead, researchers and public health officials have shifted much of their focus towards finding therapies that reduce risk, slow the progression of disease, and/or reverse neuronal damage.Additionally, since there are various subtypes of AD with underlying mechanisms, any therapy might be effective for only some AD patients.Therefore, an essential step of reducing global disease burden is to propose many new therapeutic agents that target various aspects of AD pathology.This is precisely the motivating use-case for AlzKB.As we demonstrate, AlzKB provides a rich representation of existing knowledge about AD and the biological context in which it acts.AlzKB stands to become a major resource in the AD research community, where pattern analysis and integration with observational data can be used to propose a diverse array of new therapeutic hypotheses along with interpretable mechanistic explanations of how those therapies may act in the human body.

Discovering putative therapies through graph data science
Of the PD genes predicted to also be AD genes (Section 3.2; Table 3), some are involved in neuronal signaling and structure, and some are known to be involved in a wide range of neurological disorders.FYN has seen recent attention and investigation into its possible link to AD due to its broad expression in brain tissue and known interactions with tau proteins.[69,70]Among the other identified genes, one (CHRNB1) is known to be involved in acetylcholine signaling; [71,72] another (KCNIP3) codes a protein that interacts with presenilin, and mutations in presenilin are causal for hereditary AD. [73,74] Some of these gene hits (ATXN2, DCTN1) have limited or no current research directly linking them to AD but are biologically plausible.As such, they may represent novel therapeutic targets or targets for further research and investigation.[75]For example, DCTN1 encodes the dynactin-1 protein, and deficits in dynactin are connected to several neurodegenerative diseases; however, there is limited research linking it to AD.[76,77]Among the drug repurposing predictions (Section 3.3; Table 5) are some agents that have previously been proposed for the treatment of AD (risperidone and sertraline) or for symptoms associated with AD (nicotine).Sumatriptan has been the subject of several studies focused on AD [78] and is connected to a strong comorbidity of migraine headaches and dementia in women.
[79] Pimozide has been shown to reduce the aggregation of tau protein in mice [80] and is linked to AD in a number of unrelated in silico models.
[81] The inclusion of nicotine is also noteworthy, as it has seen recent interest among AD researchers and is the subject of an ongoing clinical trial to improve memory.[82] Other drugs listed in Table 5 have not yet been identified as AD treatments and represent novel repurposing candidates.Each can be considered a testable hypothesis meriting further investigation, giving credence to the increased detective power of AlzKB's knowledge graph approach over existing AD data resources.

Future directions with AlzKB
AlzKB is a growing resource, and we have plans for adding new features and data types that are in various stages of implementation.Since a central hypothesis of AD pathogenesis revolves around the abnormal accumulation of proteins within and around brain cells, an important step will be to adequately distinguish and differentiate genes from the proteins that those genes code for.Existing data resources available for inclusion in AlzKB largely fail to make this distinction in a way that is accepted by the scientific community, so we are currently evaluating options to use either postprocessing of existing knowledge sources or synthesis of new knowledge to achieve a good representation of genes, proteins, and functional/structural variants that are key to understanding AD.Machine learning models often do not generalize well to heterogeneous graphs, such as the one that comprises AlzKB's knowledge graph.This is largely because traditional models cannot utilize the network structure and heterogeneous nature of different entity types.Several promising algorithms can be used for prediction on heterogeneous graphs -including GraphSAGE [83] and metapath2vec [84] -but most fail to scale effectively when the number of node types or edge types grows.Since any effective therapy must be accompanied by a mechanistic understanding of how it functions, we also need to ensure that new heterogeneous graph machine learning models are explainable.With these in mind, we are using AlzKB as a motivating resource for designing new, cutting-edge algorithms that produce interpretable predictions over highly heterogeneous knowledge graphs.As we do so, these will be released alongside AlzKB with educational resources that facilitate ease-ofuse and adoptability by various stakeholders.Ultimately, we aim to provide AlzKB as a robust resource that helps to unravel the etiology of AD.It is already a large, high-quality knowledge base from which graph-based AI/ML approaches can be developed for drug repurposing and drug discovery.As we and the rest of the biomedical research community make these discoveries in the coming years, they will be included and publicized on the AlzKB website as a public resource to drive innovation and scientific progress.

Obtaining AlzKB for local use and extending the knowledge graph
As a public and open-source resource for scientific discovery, we provide AlzKB through a variety of interfaces with distinct advantages for different use cases and user types.Casual users who wish to browse the knowledge base or perform simple analyses can do so directly through the Neo4j browser interface.
[17] However, for more advanced use cases (or when computational needs exceed those available on the public version of the knowledge base), AlzKB can be either downloaded and populated locally into a Neo4j installation, or it can be built from the original source data files via the tools included on the AlzKB GitHub repository.[85] The latter of these options also allows users to extend the knowledge base to include additional data sources, entity types, or relationships beyond those provided in the official knowledge base distribution.We also encourage users who make modifications to the knowledge base to submit their changes for review to include in the main code distribution.Instructions for how to contribute to AlzKB are also available on the GitHub repository.
Since the data sources included in AlzKB are all, themselves, from open-source databases, we urge users to ensure that any new data sources they merge into AlzKB similarly comply with open-source standards.In brief, AlzKB can only be maintained under the most restrictive license terms of its included third-party sources, so restrictive license terms in a database being considered decrease that database's suitability for inclusion.We hope for AlzKB to be recognized as a community effort for aggregating and democratizing the discovery of new AD therapeutics, and therefore encourage public discussion of new methods and data sources to be included.

Software and Code Availability
AlzKB is publicly available online.
[11] All pertinent source code and documentation is available on GitHub [85], or in an archived version on Zenodo.
[60] Users have the option of hosting local and/or modified copies of AlzKB; see Section 4.4 for further information.

Conclusions
In this work, we introduced the Alzheimer's Knowledge Base as a free, publicly available toolkit and data resource for novel discoveries in AD research, with a particular focus on therapeutic approaches to treating AD.AlzKB is both new and continually growing, and we aim to cultivate a community of researchers to collaboratively increase the impact, speed, and throughput of AD research, along with rapid dissemination to healthcare, academia, and the pharmaceutical industry.In the future, we will develop new AI and data science methods to continually extract knowledge from AlzKB, but in this study we already demonstrate through graph data science that AlzKB can both replicate existing AD knowledge as well as generate entirely new, testable hypotheses to drive the future of drug repurposing and drug discovery.

Figure 2 .
Figure 2. Random forest classifier performance (over 10 independent training runs) on the task of predicting whether PD genes are also AD genes based on patterns of graph connectivity in AlzKB's heterogeneous knowledge graph.Across all metrics, a score of 1.00 represents the maximum possible performance.

Figure 3 :
Figure 3: Spanning tree linking the 10 highest scoring AD drug predictions (listed in Table5) to AD. Blue nodes are drugs, pink nodes are genes, and the orange node is AD.Genes on the shortest path between a drug and AD can be considered putative mechanistic explanations for how the drug may act on AD etiology.

The Alzheimer's Knowledge Base -A knowledge graph for therapeutic discovery in Alzheimer's Disease research
JosephD Romano, Van Truong, RachitKumar, Mythreye Venkatesan, Britney E. Graham, Yun Hao, Nick Matsumoto, Xi Li, Zhiping Wang, Marylyn Ritchie, Li Shen, Jason H. Moore Submitted to: Journal of Medical Internet Research on: February 24, 2023

Table 1 :
Third-party public data sources used in AlzKB, what data elements are used from them, and website citations linking to their homepages.* indicates that the data source is included via Hetionet.

Table 2 :
Node types and counts in AlzKB, listed in descending order by prevalence.Additional node types will be added over time, and counts will increase as new data sources are incorporated or existing sources are updated to newer versions.

Table 3 :
Parkinson's Disease genes predicted by graph-augmented Random Forest model to also be associated with Alzheimer's Disease.

Table 4 :
Ranking-based evaluation metrics of 5 embedding-based link prediction models on AlzKB's knowledge graph.Metrics are derived from the likelihood of existing ('known') links to be predicted by the model.Higher scores indicate better performance.

Table 4 .
Among them, RotatE performed the best with the highest MRR and Hits@k values.We therefore used RotateE to make predictions on the test set to obtain missing head entities with the template ([drug], DRUG_TREATS_DISEASE, Alzheimer's Disease).The top 10 predicted drugs are listed in Table

Table 5 :
Drug repurposing predictions made by the best-performing RotatE topological link prediction model.Also shown are current approved indications and (if available) clinical trials investigating efficacy of the drug for treating AD.