Treatment Repurposing using Literature-related Discovery

This article describes the Literature-Related Discovery technique and its application to Treatment Repurposing (which includes, but goes well beyond, Drug Repurposing). Illustrative results of potential repurposed treatments were shown from a study on preventing and reversing Alzheimer’s disease. The detailed query used to generate these results is presented. The approach has the potential to identify voluminous amounts of candidate treatments for repurposing. Additionally, a broad review of the Drug Repurposing literature is provided. A Drug Repurposing database is retrieved and the structure and content are analyzed using Text Clustering and Factor Analysis. Two taxonomies of the Drug Repurposing literature are presented and specific major themes are shown.


Structure of Treatment Repurposing Literature
TR is the application of an existing treatment for one or more diseases to diseases or symptoms of interest other than the disease(s) or symptom(s) for which the treatment was developed (and used) initially. Many comprehensive reviews of one component of TR, drug repurposing/repositioning, have been published recently. [3][4][5][6][7][8][9][10][11][12] As shown in these reviews, as well as many other more narrowly-focused documents, there are myriad possible categorizations for the TR literature.
Two more objective perspectives on the structure of the TR literature are shown in Appendices 2 and 3. Appendix 2 contains a hierarchical taxonomy of the 2890 Medline record TR literature obtained with the CLUTO text clustering software (CLUTO. 2018. http://glarosdtcumnedu/gkhome/views/cluto, University of Minnesota). This unique taxonomy presents the higher-level and most detailed categories that constitute the core TR biomedical literature. A display of the taxonomy linked to the titles of papers in the most detailed categories can be found in Appendix 2 of (Kostoff RN. Treatment Repurposing using Literature-Related Discovery. Georgia Institute of Technology. 2018. PDF. https://smartech.gatech.edu/ handle/1853/60507). Appendix 3 of the present article contains the results of a factor analysis of the TR literature. Only the factor themes are presented and discussed. Appendix 3 of (Kostoff RN. Treatment Repurposing using Literature-Related Discovery. Georgia Institute of Technology. 2018. PDF. https://smartech.gatech. edu/handle/1853/60507) contains a factor matrix showing the themes and key phrases that had the strongest influence on determining the themes. The phrases under each factor are linked to record titles associated with those phrases.
Appendix 4 of (Kostoff RN. Treatment Repurposing using Literature-Related Discovery. Georgia Institute of Technology. 2018. PDF. https://smartech.gatech.edu/handle/1853/60507) contains extensive examples of myriad markers from the AD study and the directions in which they changed in association with the presence/imposition of AD contributing factors or the provision of AD treatments. These markers and their directions of change from treatments or contributing factors form the basis of the TR discovery approach.
The TR discovery approach presented in this article consists of a two-stage process: Stage 1: identify critical markers associated with a disease of interest and identify how the values of those markers change 1) when contributing factors to disease are operable and 2) when treatments are operable.
Stage 2: search the non-disease-of-interest literature for potential treatments that will change the markers of interest in the desired direction.
Specific Methodology Adapted from AD Study ( net/1853/61865)) was to 1) identify existing contributing factors (causes) to AD and 2) identify markers (mainly biomarkers) whose changes from the norm were associated with the AD contributing factors. Multiple approaches were used to identify these existing AD contributing factors and their associated markers, since no one approach was fully comprehensive.

Visual Inspection
A Visual Inspection approach was used initially for the AD study. It started by generating a database of millions of abstract phrases parsed from ~100,000 records that constituted the total AD core Medline literature. Then, tens of thousands of the highest frequency phrases were inspected visually and those that appeared to be contributing factors to AD were selected. During this process and in the subsequent confirmatory process that validated the selection of AD contributing factors, many non-biomedical terms were identified that were closely associated with the existing AD contributing factors (shown in the next section). These non-biomedical terms could then be (and were) used as 'linking terms', to target lower frequency phrases (among the millions of abstract phrases) that had high probability of being/including existing AD contributing factors. these markers experienced as a result of the existing AD treatment(s). These treatment-related data were also recorded.

Identify critical markers and their directions of change associated with existing AD treatments
The second step in Stage 1 of the AD study (and in the recently-completed study (Kostoff RN. Prevention and Reversal of Peripheral Neuropathy/Peripheral Arterial Disease. Georgia Institute of Technology. 2019. PDF. http://hdl.handle. net/1853/61865)) was to 1) identify existing AD treatments and 2) identify markers (mainly biomarkers) whose changes from the norm were associated with the existing AD treatments. Multiple approaches were used to identify these existing AD treatments and their associated markers, since no one approach was fully comprehensive.

Visual Inspection
A Visual Inspection approach (part of the visual inspection approach described in the previous section) was used, which consisted of reading the thousands of high frequency abstract phrases in the core AD literature and selecting those that appeared to be treatments for AD. During this process and in the subsequent confirmatory process that validated the selection of existing AD treatments, non-biomedical terms were identified that were closely associated with the existing AD treatments (shown in the next section). These non-biomedical terms could then be (and were) used as 'linking terms', to target phrases (among the millions of abstract phrases) that had high probability of being/including existing AD treatments.
These linking terms were especially valuable for accessing existing low-frequency AD treatments not accessible from visual inspection of the high-frequency phrases. Some of these linking terms had higher efficiencies of identifying the treatment

Linking Term
A number of linking term approaches were used to target records or phrases with high probability of containing existing AD contributing factors. These included: -MeSH Qualifiers associated strongly with contributing factors (e.g., adverse effects, toxicity, pathogenicity, poisoning); -Relatively unambiguous MeSH Headings associated strongly with contributing factors (e.g., "Drug-Related Side Effects AND Adverse Reactions"; Abnormalities, Drug Induced; Air Pollutants, Occupational; Amphetamine Related Disorders; Carcinogens; Chemical Warfare Agents; Chemically-Induced Disorders, etc); -Text terms associated strongly with contributing factors (e.g., -induced; caused by; induced by; -contaminated; exposure to; exposure(s) [at end of phrase]; exposed to; poisoning [at end]; -exposed [at end]; -related; -associated; -infected; abuse*; toxicity).
These linking terms were especially valuable for accessing low-frequency existing AD contributing factors not accessible from visual inspection of the high-frequency phrases.

-Dot Product
A dot product approach was used to identify phrases that had high probability of being existing AD contributing factors. External lists of toxic substances generated by Federal government organizations, state regulatory agencies and other major organizations were aggregated. The final list of toxic substances was intersected with the full list of millions of abstract phrases in the core AD literature, to identify additional existing AD contributing factors.
The total number of validated existing AD contributing factors identified by the above approaches (from the premier biomedical literature) numbered about 400-600, depending on how the existing AD contributing factors were aggregated. In all the approaches to identifying existing AD contributing factors shown above, the initial existing AD contributing factors selected were confirmed and validated by detailed reading of the relevant abstracts.
During the confirmation and validation process, one or (usually) more record abstracts containing the candidate existing AD contributing factor term were read and other relevant data in the abstract were recorded. These data included biomarkers, symptoms and behaviors impacted by the existing AD contributing factor(s) and the directions in which these markers were moved (increased, decreased, etc). In some/ many of these records, one or more existing AD treatment(s) were also identified, as well as the myriad markers associated with the existing AD treatments and the directions of change During the confirmation and validation process, one or (usually) more record abstracts containing the candidate existing AD treatment term were read and other relevant data in the abstract were recorded. These data included biomarkers, symptoms and behaviors impacted by the treatment(s) and the directions in which these markers were moved (increased, decreased, etc) associated with the treatment. In some/many of these records, one or more existing AD contributing factor(s) were also identified, as well as the myriad markers associated with the existing AD contributing factor(s) and the directions of change these markers experienced associated with the existing AD contributing factor(s). These contributing factorrelated data were also recorded.
Additional markers could have been identified using the same approaches for identifying contributing factors and treatments, but that was not done in the AD study. It was done in the more recent study on peripheral neuropathy (PN)/ Peripheral Arterial Disease (PAD) (Kostoff RN. Prevention and Reversal of Peripheral Neuropathy/Peripheral Arterial Disease. Georgia Institute of Technology. 2019. PDF. http:// hdl.handle.net/1853/61865) and approximately four times as many markers were identified compared to the AD study.

Stage 2
Search the non-disease-of-interest literature for potential treatments that will change the markers of interest in the desired direction.
Text mining of the AD biomedical literature (especially records focused on treatments and contributing factors) identified the critical markers associated with AD and identified the directions in which these critical markers needed to change for potential AD alleviation. For example, critical general biomarkers for AD and their desired directions of change included 'reduce oxidative stress', 'alleviate mitochondrial dysfunction', 'prevent apoptosis', etc. Critical specific biomarkers for AD and their desired directions of change included 'reduce BACE1', 'increase Bcl-2', 'enhance ADAM10', etc.
From these markers and their desired directions of change for effective treatment of AD, a query was developed to 1) identify potential AD treatments from 2) treatments used in the non-AD literature (see Appendix 1 for query details). The non-AD biomedical literature was then searched for records including one or more of these AD markers that moved in desired directions as a result of treatments (e.g., reduced Abeta; increased Bcl-2; reduced tau hyperphosphorylation; restricted NFKappaB signaling; reduced inflammation; reduced oxidative stress; enhanced Nrf2, etc).
Searching for records that had a threshold of including at least one of these desired marker alterations produced a voluminous retrieval. To keep the records retrieved at a manageable level, consequences of interest than others. Terms like prevent*, protect*, improv*, restor*, alleviat*, ameliorat*, mitigat*, etc, almost always gave the desired AD markers and the direction in which they changed as a result of treatment. Terms like decreas*and increas* (used initially, then abandoned), reduc*, slow*, etc, could go either way. The former group of terms had the 'sense' of improvement, while the latter group of terms reflected change (positive or negative) and may or may not have reflected improvement.
The total number of existing AD treatments identified by the above approaches (from the premier biomedical literature and validated) numbered about 600-700, depending on how the existing AD treatments were aggregated. In all the approaches to identifying existing AD treatments shown above, the initial existing AD treatments selected were confirmed and validated by detailed reading of the relevant abstracts.
The number of AD treatments we identified might seem unrealistically large at first glance, but these levels have occurred in all our disease reversal studies. These studies are based on the following systemic medical principle: at the present time, removal of cause is a necessary, but not necessarily sufficient, condition for restorative treatment to be effective. In the AD study, the treatments identified covered research over the past ~thirty years. The study did not exclude treatments that have 'failed' in human clinical trials, for the following reasons. Reading of thousands of abstracts on laboratory experiments and clinical trials of potential AD treatments has shown 1) in vitro experiments typically performed on neural cells tend to have reasonably positive outcomes, at least for those papers that surface in the peerreviewed published literature; 2) in vivo experiments typically performed on rodents (but other small animals as well) tend to also have reasonably positive outcomes, albeit somewhat less than in vitro experiments; 3) When these potential treatments reach the human clinical trial stage, especially the later phases, the success rates plummet! The explanation for this discrepancy given most often is the species difference. Humans are different from rodent's et al. and their physiological responses to stimuli are different as well. However, the toxic experiential and exposure background differences between humans who live in the sea of toxic exposures in the real world and animals who live in the very controlled environment of the laboratory are rarely, if ever, discussed.
There were many hundreds of potential causes for AD identified in the AD study (ranging from Lifestyle to Occupational/Environmental exposures). For a given individual, some causes have happened in the past and are no longer happening, but their damage trail remains. Other causes are ongoing, have caused damage and continue to cause damage.
Why would anyone expect a human being with such a toxic history to respond to a potential treatment the same way that a laboratory animal raised in a controlled environment would respond to that treatment? Furthermore, why would anyone expect a human being with such a toxic history to respond to a potential treatment the same way that another human being without such a toxic burden would respond to that treatment?
We cannot rule out failure to remove cause as a reason for the massive failure of myriad AD treatments in the clinical trials of the past three decades. That is why we retained even so-called 'failed' treatments in the present full-spectrum study of existing AD treatments. We don't know which treatments failed because 1) they were intrinsically ineffective or 2) their beneficial effects were overwhelmed by the strong negative effects of the ongoing causes remaining operable.

•
Fortunellin protects against high fructose-induced diabetic heart injury in mice by suppressing inflammation and oxidative stress via AMPK/Nrf-2 pathway regulation [49] • Protective effects of sarains on H2O2-induced mitochondrial dysfunction and oxidative stress; improving mitochondrial function and decreasing reactive oxygen species levels; ability to block the mPTP and to enhance the Nrf2 pathway [50] • Carboxyamidotriazole alleviates muscle atrophy in tumor-bearing mice by inhibiting NF-kappaB and activating SIRT1; CAI restricted the NF-kappaB signaling, downregulated the level of TNF-alpha in muscle and both TNF-alpha and IL-6 levels in serum, directly stimulated SIRT1 activity in vitro and increased SIRT1 content in muscle [51] • Protective effects and mechanism of meretrix meretrix oligopeptides (MMO) against nonalcoholic fatty liver disease; MMO inhibited the activation of cell death-related pathways, based on reduced p-JNK, Bax expression, tumor necrosis factor-alpha, caspase-9 and caspase-3 activity in the NAFLD model cells and Bcl-2 expression was enhanced in the NAFLD model cells [52] • Extract from Periostracum cicadae inhibits oxidative stress and inflammation induced by Ultraviolet B Irradiation; decreased reactive oxygen species (ROS) production. The extract attenuates the expression of interleukin-6 (IL-6), matrix metalloproteinase-2 (MMP-2) and MMP-9 in UVB-treated HaCaT cells. Also, P. cicadae abrogated UVB-induced activation of NF-kappaB, p53 and activator protein-1 (AP-1); accumulation and expression of NF-E2-related factor (Nrf2) were increased [53] Note that while the query was limited to combinations of two biomarkers only as selection criteria, the actual numbers of biomarkers in the retrieved records that moved in the desired directions for healing were typically greater (sometimes much greater) than two.

Validation of TR Candidates
The final step involved in converting an existing treatment in the non-AD literature to a repurposed AD treatment is validation that the potential AD treatment has not been associated with AD application in the literature. The following block places the validation of our TR findings in context.
Treatment discovery/repurposing validation (or contributing factor or characteristic validation) is defined as the process of demonstrating that the candidate treatment has not been used or proposed for application to the disease of interest. This will be a function not only of the scope of the disease literature assumed, but which databases are included in the definition of the disease literature.
The AD study used the Pubmed version of Medline to retrieve the core AD literature and used both the Pubmed and Thomson Reuters versions to determine previous use. All the treatment discoveries/repurposing listed above were not present in these versions of the AD literature.
the requirement that a record in the non-AD literature must contain at least two AD markers (that moved in the appropriate direction in conjunction with a treatment) to be retrieved was imposed. Even then, the retrieval was voluminous, indicating the wealth of potential AD treatment repurposing possible from an expanded well-resourced study.
As a practical matter, combinations of the more fundamental and less AD-specific linking phrases were used for the treatment repurposing query. The general form of the query was 1) combinations of the markers and their desired directions of change followed by 2) negation of records that contained existing AD treatments.
While terms such as 'reduce Abeta' or 'reduce tau phosphorylation' may be efficient for extracting existing AD treatments from the AD literature, they are very inefficient, either in isolation or especially in combination, for AD treatment repurposing from the non-AD literature. It is difficult to imagine people doing research in reducing Abeta or reducing tau hyperphosphorylation (much less doing research in both) not emphasizing the AD/dementia applications in their publications.
Finally, there are no restrictions on the numbers of treatments that could be repurposed for any disease of interest. For example, assume that a patient has been diagnosed with a specific disease, characterized by three abnormal biomarker values. The query could be applied to identify/discover 1) one treatment that would bring all three of the biomarkers back to normal, or 2) one treatment that would bring two of the biomarkers back to normal and one treatment that would bring the third biomarker back to normal, or 3) three treatments, each of which would bring one of the biomarkers back to normal. Obviously, the repurposed treatments in 2) and 3) would have to be compatible, but the technique offers a wide variety of options.

Illustrative Examples of Potential Repurposed Treatments for AD
Appendix 1 contains the details of the actual query used to identify potential repurposed treatments for AD. Since the treatment repurposing described in the AD study was a proofof-principle demonstration of the latest incarnation of our LRDI approach, only a few examples were provided for il-nous TR results for any disease or symptom of interest; the only limitations are study resources.
The same generic process can be applied to identifying contributing factors to a symptom or disease of interest that have not been existent previously in the core biomedical literature of that disease or symptom of interest but have been existent in the core biomedical literature of other diseases or symptoms. The same extrapolation process can be used for myriad markers as well.
For identification of both treatments and contributing factors, research needs to be done on which combinations of biomarkers will be most fruitful in retrieving the largest volume of high-quality TR and new contributing factor candidates.
What are the characteristics of such combinations that will maximize marginal utility?
There could be other ways to define the scope of AD. There are also many other databases that could be searched for validation purposes, including other literature indexes, patent databases, books, magazine articles not indexed in Pubmed or Thomson Reuters, etc. Thus, the present validation should be viewed as limited, even though it is the method used and the databases used, by most (if not all) of the literature-based discovery community. For discovery patenting purposes, or other purposes, more extensive validation and larger numbers of databases, may be required.
In summary, each candidate potential AD treatment retrieved using the discovery query required validation before becoming a potential AD treatment. The candidate potential AD treatment was intersected with the core AD literature in late 2017 and was validated only after this intersection showed orthogonality.
For example, the candidate potential AD repurposed treatment "fortunellin" was retrieved because it satisfied the desired query general biomarker combination of reducing inflammation and oxidative stress. Fortunellin also had the additional specific biomarkers-based benefits of reducing the pro-inflammatory cytokines and the expression of p-IkappaB kinase alpha, p-IkappaBalpha and p-nuclear factor-kappaB, while significantly enhancing superoxide dismutase, catalase, heme oxygenase-1and p-AMP-activated protein kinase. Fortunellin was intersected with the core AD biomedical literature retrieval terms (alzheimer* OR dementia OR "mild cognitive impairment") and no records were retrieved, demonstrating that fortunellin could not be found in the core AD literature. Fortunellin was therefore validated as a potential AD repurposed treatment (LRD Discovery).

CONCLUSION
The TR literature reflects intense interest in the medical community for extracting maximum utility from drugs developed already. TR practitioners come from varied discipline communities and make use of myriad analytical predictive approaches, including text-mining, machine learning, networkbased, semantics, ligand-binding/ligand-protein docking/ binding-site focused, protein targeting and transcriptional signature-focused. The main diseases studied are cancer, neurodegenerative and infectious. TR may be equally important for the rare diseases, where the modest number of potential patients may not justify the expense to the manufacturers of separate new drug development. The main biomarker targets studied focus on oxidative stress and inflammatory metrics.
The LRD-TR approach has evolved from its initial structure in 2008 [1] to the more advanced and targeted process described in the present article. It has the capability to generate volumi- The most general form of the TR query can incorporate any number of biomarkers and other markers of interest. For AD, a two-biomarker query was deemed adequate for demonstration purposes. The generic form of the two biomarker AD treatment repurposing query is A is a biomarker and its associated desired direction of change B is another biomarker and its associated direction of change C is the query used to retrieve the AD core literature D is a list of existing AD treatments identified in the initial part of the AD study Thus, the combination (A and B) retrieves ALL records from the full biomedical literature that contain potential AD treatments based on the two desired characteristics A and B, while (C or D) subtracts those records and existing treatments associated with the AD core literature. The remainder is non-AD records with substances that could be candidate repurposed AD treatments, based on the requirement that A and B must be present.

OR
Existing AD Treatments (see full list of the existing AD treatments that wre used in Appendix 6-1 of the AD study) sub-themes of personal interest within the cluster. The full reference for each title is provided in Chapter 4 of the same referenced document, which will allow the reader to pursue the full text for further information.
Before the details of the 32 'leaf' (lowest level) clusters are presented, a high-level (top three levels) view of the TR text clustering taxonomy is shown in Table A2-1. The first bifurcated level of the hierarchical taxonomy shows two definite thrust areas: Methods for drug repurposing prediction (Cluster 58) and disease treatments that resulted from drug repositioning (Cluster 61). The next two levels of the hierarchy are self-explanatory.   Many thousands of candidate repurposed AD treatment records were retrieved. While five were selected and validated, many more were available.
The AD study generated ~200 biomarkers and ~50 symptoms. The 20 terms selected for the above query were all biomarkers. The PN/PAD study, building off the experience from the AD study, has identified ~750 biomarkers and ~250 symptoms. More are still possible.  (Table A2-1) and the 32 elemental clusters (the lowest and most detailed level of the taxonomy- Table A2-2) will be presented in the following sections.
The version of CLUTO used for this analysis does not include fuzzy clustering, so each record is assigned to one cluster only. Many records contained multiple themes and could have been assigned to more than one cluster. Nevertheless, the taxonomy does provide a unique and interesting perspective on the structure of the TR literature. Tables A2-1 and A2-2 are, of necessity, very broad. In Appendix 2 of (Kostoff RN. Treatment Repurposing using Literature-Related Discovery. Georgia Institute of Technology. 2018. PDF. https://smartech.gatech.edu/handle/1853/60507), the titles are provided for each of the 32 elemental clusters shown in Table A2-2, to provide the full spectrum of sub-themes within each elemental cluster and allow the interested reader to identify specific Appendix 3: TR Literature Taxonomy based on Factor Analysis.

Overview of TR literature taxonomy based on Factor Analysis
The previous appendix provided one perspective (text clustering) on the taxonomic structure of the TR literature. The CLUTO text clustering software incorporates all phrases (minus stop-words) and uses one selected algorithm to generate a hierarchical taxonomy. Another approach providing a complementary perspective on the TR literature structure is factor analysis. Here, only pre-selected phrases are used.
The present appendix contains the results of a 37-factor study, where three of the factors define strong themes (high factor loadings) at each end of the phrase list (Factors 5, 8, 11). The main theme of each factor is presented here. A full factor matrix that identifies the main theme of each factor (and shows the key phrases that determine each theme) is presented in Appendix 3 of (Kostoff RN. Treatment Repurposing using Literature-Related Discovery. Georgia Institute of Technology. 2018. PDF. https://smartech.gatech.edu/handle/1853/60507).
In this reference, the titles of the records associated with each of the key phrases in the 37 themes is presented as well. The factor analysis was conducted using the Vantage Point software (thevantagepoint.com).

Results of Factor Analysis
While many broad thematic categories were possible, four appeared to be dominant. These included: Repurposing Prediction Methodologies, Diseases, Biomarker Targets and Drug Types. Figure A3-1 lists all the 37 factor themes. The main diseases studied are cancer, neurodegenerative and infectious. The main biomarker targets studied focus on oxidative stress and inflammatory metrics. While drugs of many different classes have been researched for repurposing, the main drug classes as emphasized in Figure A3-1 are inhibitors of myriad signaling pathways. Finally, the main repurposing prediction methodologies studied focus on networks, similarity, machine