CACTI: an in silico chemical analysis tool through the integration of chemogenomic data and clustering analysis

It is well-accepted that knowledge of a small molecule’s target can accelerate optimization. Although chemogenomic databases are helpful resources for predicting or finding compound interaction partners, they tend to be limited and poorly annotated. Furthermore, unlike genes, compound identifiers are often not standardized, and many synonyms may exist, especially in the biological literature, making batch analysis of compounds difficult. Here, we constructed an open-source annotation and target hypothesis prediction tool that explores some of the largest chemical and biological databases, mining these for both common name, synonyms, and structurally similar molecules. We used this Chemical Analysis and Clustering for Target Identification (CACTI) tool to analyze the Pathogen Box collection, an open-source set of 400 drug-like compounds active against a variety of microbial pathogens. Our analysis resulted in 4,315 new synonyms, 35,963 pieces of new information and target prediction hints for 58 members. Scientific contributions With the employment of this tool, a comprehensive report with known evidence, close analogs and drug-target prediction can be obtained for large-scale chemical libraries that will facilitate their evaluation and future target validation and optimization efforts. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-024-00885-2.


Introduction
Understanding how a drug interacts with its biological target and whether its inhibition results in the desired phenotypic response is a critical step in drug discovery [1][2][3].One area where this is particularly true is for infectious diseases where many of the hits that are advanced for drug discovery come from phenotypic, organismal screens as described in a recent review [4].For malaria parasites, tens of thousands of compounds with parasite killing activity have been placed in the public domain.This has led to medicinal chemistry programs and molecules entering clinical trials [4,5].While understanding the mechanism of action or knowing the target is not strictly essential for a compound series to advance, knowing the target and ideally having a crystal structure for the target can make subsequent medicinal chemistry optimization much more efficient: fewer compounds need to be made and evaluated, and biochemical assays are typically more robust and lower cost.Furthermore, screening and assay costs can be many orders of magnitude lower for a biochemical target relative to whole organism work.
Because of the huge-cost savings that can come from knowing a target, substantial effort is often invested into target discovery.Biochemical and genetic approaches (e.g. in vitro evolution) are two examples of methods used for target identification.However, these tend to be lengthy and are resource-consuming [4,5].For malaria parasites, in vitro evolution has been a successful method but can often take six months or more [4].It thus makes sense to accomplish as much computationally as possible.For example, molecular docking is commonly pursued to predict small molecule ligand binding to key therapeutic proteins prior to biochemical or phenotypic assays, and though it tends to have a high false positive rate [6], drug targets for some antimalarial inhibitors have been identified through this approach [7][8][9].Similarly, virtual screening is a commonly applied.Singh et al. [10] describes how this approach can drastically reduce the chemical space to provide a focused library for subsequent biochemical high-throughput screening (HTS), although it is most reliable when knowledge of compound activity is available and libraries against a target of interest are accessible.Antimalarial compounds have also been elucidated through virtual screening [11][12][13].Comparable in silico approaches also exist for target identification.For example, the tool TargetHunter by Wang et al. [14], is a web-based prediction approach that incorporates analog bioactivity data from ChEMBL [15].Likewise, Chemmine, is an online resource by Backman et al. [16] that predicts targets based on similar records in PubChem database [17].More recently, Jimenes-Vargas et al. [18] showed the power of artificial intelligence in drug discovery.Here, authors predict compound-target interactions based on different molecular description calculations and machine learning algorithms using ChEMBL records.
A constraint for these resources is the limit of one molecule per search and one chemogenomic database.Modern high-throughput screens may involve millions of compounds and thousands of primary hits.Having highthroughput methods to rapidly assess many hits can allow prioritization of compounds for resupply and resynthesis, which is particularly important for academic researchers that may have limited access to compound management groups or limited medicinal chemistry capacity.For example, in malaria parasites, many dihydrofolate reductase (DHFR) inhibitors have been compromised by widespread resistance [19].It thus makes sense to identify phenotypic hits acting against DHFR prior to compound resupply.In fact, in silico approaches have proven to be a valuable in the analysis of large libraries, especially when various resources are utilized together to reduce the number of compound resupply and experimental testing needed [20,21].
To address the need of an automated multi-compound analysis tool, particularly for neglected diseases, we constructed a pipeline named CACTI (Chemical Analysis and Target Identification), to provide comprehensive searches in chemogenomic databases and provide biological targets clues with single or bulk queries, integrating data not only from major databases such as ChEMBL and PubChem, but also from ligand-based databases (e.g.BindingDB [22]), as well as scientific and patent evidence (e.g.PubMed [23] and SureChEMBL [24]).Furthermore, to account for the difference in molecule identifiers across databases, we implemented a cross-reference method to map a given identifier based on chemical similarity scores and other known identifiers, or synonyms, in the expanded search.With this process, we provide a comprehensive large-scale study report incorporating all available identifiers for each small molecule, its close analogs, and available bioactivity data and/or mechanism of action (MoA) if known.

Chemogenomic database selection and accessing
A common first step to evaluate a small molecule as a potential therapeutic, includes the exploration of chemical and biological space to identify known information and potential target.In contrast to most available tools that focus on one database for target-predictions [14,[25][26][27] we explored multiple chemogenomic databases as knowledge source.The solely criterion for database selection was the availability of REST API, a data request protocol for query and transfer.For chemical-and experimental-based information sources, we selected the well-established ChEMBL and PubChem databases.BindingDB was selected for protein-ligand evidence and, EMBL-EBI and PubMed as sources for comprehensive searches (Table 1).Although each bears unique data, we observed some overlap in data content, especially for literature evidence.However, it is important to note the indexing type and database specification prior to integration, as not every record is curated, and query fields could have different meaning.For example, experimental data in PubChem could be from biological assays (doseresponse, phenotypic, or target-based) or biochemical assay (profiling or binding), while BindingDB refers to binding affinities assays solely.
After database selection, we created a custom function to access and retrieve data from each source.Following the corresponding REST API web services, we constructed an html query link with the database domain + web service architecture + data of interest + parameters.Specifically, we used chembl_id_lookup, mechanism, document, compound_record, molecule/ synonym and similarity data functions from ChEMBL.For PubChem we used compound's data SMILES, name, CID, PubMedID, PatentID, fastsimilarity_2d, synonyms and MolecularWeight functions.PubMed records were found using the PMID and PMC xref from PubChem, and europepmc/search function from EMBL-EBI server.The command, GetTargetByCompound, was used to access BindingDB database.For exact URL query construction please refer to the database web-services [22-24, 28, 29].

Querying and standardization of chemogenomic databases using SMILES
We reasoned that, to investigate a set of query compounds, the first step was to explore hidden patterns among the various chemogenomic databases containing large datasets of small molecules and their bioactivity.However, these patterns are difficult to reveal when searching across databases, mainly due to the lack of indexing standardization and the existence of multiple equivalent representations of a chemical structure (SMILES [30]).For example, ethanol can be encoded with SMILES OCC, CCO and C(O)C.Furthermore, in ChEMBL the SMILES CCO corresponds to CHEMBL545 but C(O)C is not a valid query, and vice versa the notation C(O)C corresponds to CID702 in PubChem while CCO is not valid.Nevertheless, to test the hypothesis that using SMILES as identifier is sufficient for mapping the chemical space, we used the provided query SMILES as first input (Fig. S1A).To reveal the index identity of each query, we used the custom functions to cross-reference across databases using 100% similarity match, and further confirmed their exact identity by comparing Morgan fingerprints as described in our Methods.
Due to submission discrepancies across repositories, query identifiers were combined in a "synonym" (common names) column and filtered if they were (1) numerical with no indication of external source origin, (2) IUPAC name from an unreliable source, or (3) duplicated when converted to upper case and removal of special characters such as ":" or "-".The remaining synonyms were used to retrieve scientific evidence, patent evidence and additional useful information associated with the compound of interest.We reasoned that expanding the search to include these identifiers would provide more exhaustive research on the query compounds, improving the target-prediction steps when using the additional pieces of evidence.With this filtering, invalid or duplicated query records were removed and bioactivity data, naming synonyms, scholarly evidence, and chemical information across selected chemogenomic databases was integrated.

Chemical comparison through similarity calculations
In addition to mining with SMILES and common names, we expanded the search to include closely related analogs (Fig. S1B).We first used RDKIT v.2024.03.1 [31] to convert the query SMILES to a canonical form, generating a unique notation to be queried and compared with equal features.The standardized format (canonical) was used to identify analogs by transforming them to a binary representation called fingerprints (Morgan fingerprints), which allows chemical similarity calculations.These similarities were computed using the Tanimoto coefficient [32], T, which measures the degree of similarity between a query and target structure as follows: where A represents the query fingerprint and B the target fingerprint, N A and N B represent the number of "1-bit" in A and B respectively and N AB represents the "1-bit" shared in both.To ensure consistency, analogs were transformed to the same canonical form as the query and filtered if the T score was below 80% threshold (score ranging from 0 to 1 transformed to percentage).This conversion allowed us to further reduce discrepancies between SMILES conversion while identifying chemical analogs.We reasoned that by selecting close analogs with an 80% similarity, as opposed to the suggested 85% threshold by Patterson et al. study [33], we may be able to retain useful moieties related to antimalarial activity (such as functional groups), while maintaining a high degree of chemical similarity between the query and analog compound.Although a target may have been reported for a given compound, the query compound might be similar but not identical and our goal was to capture these associations.To identify scaffold families, we constructed a similarity network (including the query dataset and searched analogs) by assigning nodes to chemical entities and edges between entities with a Tanimoto coefficient above 80%.With this network we were able to identify query drug-targets from analogs with known mechanisms of action or gene target, elucidating relationships that could be obscured otherwise.

Drug target identification
In order to provide drug-targets clues, we acquired and incorporated several large datasets with annotated compound-target pairs and created a module to search these sets for compounds closely related to query molecules (Fig. S1C).We used the Novartis Chemogenetic Library [34] assembly consisting of 4185 compounds annotated with their primary mammalian gene target, an in-house, consolidated file of 163 validated antimalarials originating from phenotypic screens with their drug target, a collection of 218 licensed well-validated therapeutic drugs extracted from well-established databases such as IUPHAR/BPS pharmacology database [35] or the FDA approved drugs list [36], and a set of 157 known antibacterial inhibitors [37] from which seven (azithromycin, mupirocin, dapsone, sulfalene, sulfadiazine, triclosan, sulfamethoxazole) overlap with the antimalarial set.The assembled list can be accessed at https:// github.com/ winze ler-lab/ CACTI metadata folder.
To reduce biases when comparing large datasets with molecules of wide mass range, we implemented a method to partition the query and target dataset by molecular weight (default cutoff of 500 Da) and convert each into fingerprints for subsequent similarity calculations.We then implemented a function to allow selection of fingerprint conversion algorithm (RDKFingerprint or Get-MorganFingerprintAsBitVect) and selection of binary vector size, to prevent biased structure pattern identifications.Once fingerprints were calculated, we created a NxM matrix, where N and M is the number of fingerprints in the dataset and calculated Tanimoto coefficients for every pair, followed by clustering molecules based on their score (default 0.8).

CACTI, a large-scale chemical compound analysis tool
As a final step we aimed to integrate the chemical querying and chemical comparison approaches to construct a compound analysis pipeline that will systematically (1) identify index identifiers for a set of given compounds, (2) consolidate relevant information known for index identifiers and identify close chemical analogs, and (3) identify scaffold families to provide target hypothesis with collected index annotations, analogs and known drug-target information.(Fig. 1).The pipeline, CACTI, is accessible at https:// github.com/ winze ler-lab/ CACTI.Shortly, we built the customizable pipeline with python version 3.12 currently executable with UNIX, allowing for single or bulk queries and the selection of analyses and parameters to be performed.After submission validation of the query SMILES set (Fig. 1A), the first step

INTERNET
Server Database Fig. 1 CACTI drug target prediction workflow.Pipeline workflow steps are illustrated by modules where A represents the user SMILES set as input in a tabular or comma delimited, B refers to the database querying and retrieving steps using SMILES to identify synonyms, citation evidence, patents, membership in datasets and additional information for target prediction.Module 2 refers to C the identification of close analogs in chemogenomic databases to the query set, and D transformation from SMILES to binary fingerprint for scaffold comparison.In module 3, E clustering chemical entities based on similarity measures is performed and F target predictions are completed combining information from previous modules and validated known-target datasets in the pipeline was to use the chemical and biological space exploration method to obtain the multiple identifiers from which a single compound is known, including IUPAC and common name(s), as well as peer-reviewed publications, deposited datasets, and patents associated to each common name (Fig. 1B).To identify analogs in the public domain that might have a known target, we then retrieved the list of similar scaffolds to the ones of interest (Fig. 1C) with at least 80% similarity.This step allowed us to increase the likelihood of accurately uncovering drug-target pairs or providing prediction starting points.
To better understand the query molecules and relationships within the dataset and public domain using the acquired genomic and bioactivity annotations, the next step in the pipeline includes the chemical comparison methods.SMILES for the query set and analogs are transformed to binary fingerprints (Fig. 1D) to allow comparisons between them, and clustered together to find core scaffolds that are similar to at least 80% to each other (Fig. 1E).Although this step seems redundant after querying for similar structures in Fig. 1C, performing this extra validation was valuable to discover if core scaffolds from the analog subset is shared by two or more query compounds.Lastly, to predict a drug-target under the molecular similarity principal (Fig. 1F), scholar references, known mechanisms of action or gene targets, and bioactivity data associated to compound members within each cluster are compared.It is important to mention that in case of compound analog reacting noncovalently against a biological target yet considered active, typically referred as pan-assay interference compounds (PAINS), target hypotheses may be misguided for a query compound based solely on the bioactivity annotations.Thus, careful inspection may be needed when evaluating members from the cluster and additional annotations may need to be considered.

Results
To facilitate the exploration of the chemical and biological space for large datasets, and to provide a better understanding of chemical structures and their impact on a biological system, we constructed a drug-target prediction tool using data mining and chemoinformatic techniques.Our overall objective was to automate the tedious molecule-by-molecule searching that occurs when hits from a high throughput phenotypic screen are initially evaluated, and to perform this search in a comprehensive, unbiased fashion, looking for information not only on the query molecules but also on closely-related analogs.Because we noted that molecules might have different common names in the literature, we also sought to systematically uncover synonyms.For example, the new imidazolopiperazine antimalarial named ganaplacide that is in phase III clinical trials has been known as KAF156, and GNF156 and substantial work has been published on the closely related scaffold named GNF179 which differs from KAF156 by a single halogen substitution.Our motivation was thus to provide a tool that would allow the user to focus energy on phenotypic screening hits that might have novel mechanisms of action and thus avoid compounds with a well-established mechanism or known drug resistance liabilities.In addition, we wished to create a tool that could alert the user to relevant information from work on other species.If a compound is a dihydrofolate reductase inhibitor in humans, it is likely also a dihydrofolate reductase inhibitor in malaria parasites [38].
As described in the methods, this tool, which we have named CACTI consists of 3 modules, was coded into python and accepts queries in the forms of tabular or comma-delimited files containing the SMILES of query molecule(s), as well as additional field columns such as the standard name, if desired.The output consists of Excel files from each selected analysis (module) with a comprehensive report for all query molecules with synonyms and scholar evidence mined from the selected knowledge databases including ChEMBL, PubChem, BindingDB and PubMed.The output also includes a complete list of close analogs to query compounds and their similarity scores from the network construction module.Lastly, it includes a report from a clustering module in which query compounds are matched to similar molecules from the annotated target set.These last two reports can be exported to external tools for network visualizations, such as Cytoscape [39].CACTI can be accessed from the public GitHub repository and executed by creating a local copy and following instructions indicated in the repository.

Case study: assessment of drug-target querying using the Pathogen Box dataset
In order to assess the performance of the tool, we applied it to the Pathogen Box dataset [40] from the Medicines for Malaria Venture (Fig. 2A).This set is a collection of 400 diverse drug-like molecules, with an average molecular weight of 374.13 Da, active against several neglected tropical diseases.This set was physically assembled by the Medicines for Malaria Venture and shared with users working in the neglected disease space to provide druglike starting points for oral drug discovery.Accordingly, only 32 compounds failed to comply the standard Lipinski Rule of Five [41].The set includes 26 positive controls from various scaffold families, 24 of which are part of known drug-target pairs.Controls include well known drugs like the antimalarials doxycycline, primaquine and mefloquine [42], the antibacterial rifampicin [43], the antimicrobials pentamidine, nifurtimox and fluoxetine, the anti-schistosomiasis medicines, praziquantel and mebendazole.For these controls, we often found too much information, and these were thus excluded from further consideration.The remaining 374 compounds (Table S1) are ones that have reported activity in published phenotypic screens for neglected tropical disease, with the majority being antimalarial or antimycobacterial compounds [44][45][46] (Fig. 2B).Chemical properties for the 400 compounds and biological activity in multiple parasites and pharmacokinetic measurements, including cellular toxicity and pharmacokinetic measurements in HepG2 cells, can be found in ChEMBL-NTD Table S1 Table S2 Table S3 Novartis set

Module 1 identifies 4,315 synonyms for the Pathogen Box compounds
A good literature search can prevent months of wasted effort, but searching biological databases such as Pub-Med is hampered by the fact that compounds frequently change names in biological publications.We assessed if we could find synonyms for the compounds in an automated way.Searching using CACTI identified 4315 synonyms for compounds in the Pathogen Box, that were not included in the initial Pathogen Box description, with an average of 11 names per compound (Table S1, Fig. 2C, D).For example, we found 10 synonyms for compound MMV006741, including GNF-Pf-4484, May-bridge3_001644, and BRD-K92165166-001-01-3, among others.Likewise, the automated synonym search revealed that one trivial name for MMV676600 is danusertib, a known pan-aurora kinase inhibitor [47].Overall, we found 69 different names for MMV676600.These data also provided clues about the provenance of the compounds.The search revealed substantial overlap between the GSK TCAMS (Tres Cantos Anti-Malarial Set) library [48] (ChEMBL-NTD (https:// chembl.gitbo ok.io/ chemblntd), set 1) with 164 compounds present in both sets, 74 compounds overlapping with the St. Jude Biomedical library [49] (https:// chembl.gitbo ok.io/ chembl-ntd, set 3) and eight that were members of the Novartis GNF library [50] (https:// chembl.gitbo ok.io/ chembl-ntd, set 2).The automated search also revealed members from the Pathogen Box that are readily accessible to the scientific community.We found 206 compounds with suppliers' identification numbers for various commercial sources including AKOS, Zinc and MCULE.In general, module 1 was able to capture and identify diverse names from published studies and public libraries, as well as provide the unique identifiers for selected databases including PubChem (CID identifier) and ChEMBL (CHEMBL identifier).

Module 1 identifies new citations, datasets and patents for Pathogen Box molecules
No literature citations were provided with the initial Pathogen Box dataset.Therefore, we next used CACTI to comprehensively identify patent and literature citations using not just the MMV names but all synonyms (Fig. 1E).Unsurprisingly, searching with the MMV names often produced literature that described the Pathogen Box.However, searching with synonyms provided many more citations on the compounds.For example, MMV675997, is associated with 4 pieces of evidence that were deposited separately under several chemogenomic databases with different synonyms (MMV675997, CHEMBL1094051, BDBM50313769) (Table S1).Searching with these synonyms showed that this compound, which has a peptoid backbone and a nitrile P1′ warhead came from a library predicted to contain T. brucei Metacaspase enzyme inhibitor [51].Likewise, the kinase inhibitor, danusertib (MMV676600) is associated with a total of 1195 pieces of evidence with citations indicating a possible role in inhibition of Toxoplasma gondii calciumdependent protein kinase 1 CDPK1 (calcium-dependent protein kinase 1), as well as 100 pieces of patented evidence.From these, a CACTI PubMed search using PHA-739358, the additional synonym of MMV676600, yields three more literature references compared to MMV676600 (Table S1).
The literature search also showed that some of the compounds have activity in multiple species.For example, Murphy and colleagues [52] discovered that MMV676050 and MMV676182, both tested against cryptosporidiosis in the Pathogen Box activity profiling, are potent inhibitors targeting CDPK1 T. gondii with known crystal structure (PDB ID 3N51 [52]).Similarly, the literature search revealed that 162 members of the Pathogen Box showed evidence of a target validation in other models, such as cancer cell lines.For example, the chemotherapeutic milciclib (PHA-848125), currently in clinical trials, is a known dual cyclin-dependent and tropomyosin receptor kinase inhibitor [53,54].Overall, we found 30,425 new pieces of literature, 3356 patents and 2182 datasets associated with the Pathogen Box compounds.Though exploring the literature for different names may be straightforward for a small-scale study, as the study size increases, it becomes more difficult to capture these nuances.Therefore, the implemented method of acquiring and cross-referencing all known identifiers to retrieve data from the several chemogenomic databases provides a more detailed and complete report.

Module 2 identifies 12,383 new closely related compounds in the public domain that can be used for SAR by inventory
An important step for compound evaluation and testing is determining if there are closely related analogs available in the public domain that can be sourced for testing.Performing a Tanimoto similarity search for the 374 uncharacterized Pathogen Box members resulted in 12,383 close analogs (T = 90%) from the selected chemical knowledge databases (Table S2, Fig. 2D).In contrast, only eleven compounds (MMV393995, MMV676269, MMV676442, MMV021057, MMV202553, MMV688754, MMV099637, MMV676050, MMV676064, MMV676204 and MMV676398) out of the 374 had no obvious analogs in the public domain under this criterion.Most analogs identified belonged to PubChem (12,246 analogs), BindingDB (112), or ChEMBL (25) databases.On average, 34 closely related analogs were identified for each member of the query set with an average similarity of 94.44%.For example, we found 55 analogs of MMV019721, a recently discovered acetyl-coenzyme A synthetase Plasmodium inhibitor [55].Many of these analogs are commercially available.
Interestingly, we observed eight closely-related analogs that were identified twice when querying for the 374 Pathogen Box compounds (Table S2, Fig. S2).Finding a close analog for two different query compounds suggest the presence of similar core scaffolds and likely similar biological targets.For example, the compounds MMV595321 and MMV676477, share five different analogs (CID's 136636721, 136636828, 156278160, 156278182 and 156285860), most of which are part of an antiparasitic inhibitor optimization effort (patent WO-2021077102-A1).Another analog (CID 44526919), related to MMV023969 (TCMDC-134161) and MMV024035 (TCMDC-134227), inhibits P. falciparum growth by up to 97% (IC 50 < 2 µM) [56].Thus the related Pathogen Box compounds could be attractive drug starting points.

Module 3 for new target discovery hypotheses-guilt by association approaches
One of the most resource-consuming steps in drug discovery is target deconvolution.To create target predictions, we utilized the information collected from previous modules and clustered the 374 Pathogen Box compounds and 156 close analogs, with sets of 4,716 molecules with known mechanisms of action to create hypotheses about the function of Pathogen Box compounds.The "known target" set was derived from multiple sources, including 4,185 compound-target pairs assembled by Canham et al. [34], 150 known antibacterial drug-target pairs [37], and a collection set of 381 antimalarial compound-target pairs extracted from established databases such as IUPHAR [35] and PDB Ligands [57], in addition to validated target-ligand pairs originating from the TCAMS and GNF Novartis Malaria Box phenotypic screening libraries.Although the assembled library of known target pairs was enriched for antimalarial inhibitors, many have activity across the parasitic and bacterial diseases included in the study, such as atovaquone and doxycycline.It is worth noting that targets for 152 Pathogen Box compounds (Table S1) were already known and were already included in the known target set.Clustering the 374 and their 156 close analogs with 4,716 known targets resulted in 77 clusters using a similarity threshold of T = 80% (Table S3), with an average of 4 compounds per cluster, and 264 singletons (Fig. 3A).Out of the 77 clusters, twenty-five had two or more Pathogen Box molecules and 71 clusters contained one or more MOA/ target or putative target annotations.From the 71, 20 have at least one Pathogen Box compound with unknown or unclear mechanism.We did not find a MOA/target prediction for six clusters (Fig. 3B).
To assign targets for the Pathogen Box compounds, we inspected members from each cluster and assigned a potential target/function for the "unknown" Pathogen Box compound based on their counterpart target evidence.From this, we found targets appear repeatedly at rates higher than expected (hypergeometric mean function = 4.4 × 10 -3 − 2.6 × 10 -48 ) (Fig. 3C).For example, three clusters [55,56] contain predicted dihydroorotate dehydrogenase (DHODH) inhibitors and six contain predicted cytochrome b inhibitors, of which cluster 62 have one Pathogen Box compound (MMV687807) with no known reported target in Plasmodium or Mycobacterium.We also found 9 clusters whose members contain an analog or known target set compound from two or more different annotations.This includes cluster 338 where close analogs to MMV687700 are known salicyl-AMP inhibitor and other members from the known drug target set are known human adenosine receptor agonists.Despite this apparent incongruity, all compounds in cluster 338 contain the basic adenosine scaffold and are adenosine analogs.Another example includes cluster 177 where MMV688283 and MMV687246 (a PfCDPK5 inhibitor [58]) are clustered with TCMDC-138293, a known Plasmodium DHODH inhibitor.This may suggest some polypharmacology or potential weaknesses in the automated annotations.
We also found cases where the predicted target was assigned based on evidence from a different parasite genus to the one initially tested.The diazine scaffold family (cluster 30) has two compounds that have demonstrated activity against parasite and mammalian methionine aminopeptidase-1b, an enzyme predicted to catalyze the removal of the N-terminal initiator methionine during protein synthesis in parasites and mammals (MMV084997/GNF-Pf-359).The group also contains the uncharacterized, anti-kinetoplastid Pathogen Box compounds MMV658993 and MMV658988.This association creates a hypothesis that MMV658993 and MMV658988 may target methionine aminopeptidase-1b in kinetoplasts.

Discussion
Neglected diseases (parasitic diseases, and to some extent, rare diseases), attract less commercial interest and significantly less overall funding than other diseases, such as cancer or diabetes.On the other hand, much of the small molecule data for well-funded diseases such as cancer lies in the private, well-curated databases that are maintained by pharmaceutical companies.These databases are often one of a company's biggest intellectual property assets.In contrast, for neglected disease, a larger proportion of the drug discovery data will be generated by academic researchers or public-private partnerships where much of it will ultimately be placed in the public domain.The availability of well curated public screening datasets (e.g.such as the Pathogen Box dataset) creates both opportunities to make new discoveries (e.g. for drug repurposing and target discovery) as well as challenges that relate to the dispersed and disorganized nature of the data.To make the task of assessing diverse data more accessible to neglected disease drug discovery researchers, we constructed a tool to pragmatically assess the major chemical databases and identify all data available for a query compound, as well as similarities between small molecule inhibitors in any system.With this approach a comprehensive report is generated and may be used to redirect laboratory resources in a more focused way, in addition to reducing the amount of work needed for drug-target or chemical optimization efforts.
Our approach has its limitations.First, our algorithm makes no assessment of literature quality.A manuscript reporting that a given compound binds a target may not rigorously assess the quality or strength of the binding.This may happen when a researcher develops a biochemical assay for a target and then tests a library, such as the Pathogen Box library, against this target: In this case the best compounds from this exercise may be reported as potential inhibitors of the target in a publication.Our computational algorithms cannot catch these nuances and are out of scope; although a thorough report on a particular query set will be obtained, a human review is needed to revise and confirm the veracity/ strength of acquired evidence.Perhaps not surprisingly, the data-querying strategy across multiple chemogenomic repositories is especially helpful when querying for small molecules that have been evaluated for activity against multiple disease indications.For example, compounds advancing to clinical-trials or those having a validated drug-target, are query entries that when searched provide the most complete information.Another limitation is that our approach somewhat relies on the assumption that a compound will have the same target in Trypanosoma cruzi as in Toxoplasma gondii, despite the two species belonging to different eukaryotic phyla.Nevertheless, previous studies have highlighted that drug-target pairs tend to be conserved across species.A compound like methotrexate likely acts against dihydrofolate reductase in human and in malaria parasites.Cladosporin targets lysyl-tRNA synthetase in both yeast and malaria parasites [59].There are hundreds of other examples of target conservation across species.Species selectivity thus comes from natural or engineered specificity, as well as innate ability of different species to detoxify compounds.On the other hand, just because a compound targets the electron transport chain in malaria parasites doesn't mean that it targets the electron transport chain in M. tuberculosis (e.g.such as in ELQ-300, with unreported target in tuberculosis) and predictions and caution is needed.As a future improvement to the pipeline, similar to gene ontology strategies (GO terms), querying and comparison of medical subject headings (MeSH terms) associated to compounds of interest (and analogs) can be implemented to mitigate the need to rely on similar structure function assumptions.This could increase the confidence of CACTI target identification hypotheses.
Another concern is that we relied on chemical scaffold clustering: This chemical comparison depends on the chemical fragment partition that is initially selected, resulting in potential arbitrary cluster assignment.In addition, clustering is less suitable for natural products, which tend to have more complex structures than smaller molecules.Increased size means partition methods favor fragment partitions that break down molecules and ignore the R groups that may be vital for the pharmacological effect of the natural product.Thus, although the provided scholarly report will be helpful, the clustering approach as scripted is less suitable for their evaluation.Nevertheless, we found that close analogs with similar drug-target matches did tend to group together even if the larger group often split into more than one cluster.We are confident that the established prediction method provides an initial hint for further target validation efforts.
A limitation with this approach includes the predictive component of the analysis.Though it is accepted that similar molecules tend to have similar mechanisms of action or drug-targets, experimental validation is needed to confirm binding to the desired target.The assumption that one small molecule will have just one target is more fraught as compounds get larger and where two different pharmacophores (e.g. an ATP-like molecule and a naphthalene) may be incorporated into one small molecule.In addition, for promiscuous inhibitors, like kinase inhibitors, our approach may provide limited resolution.
Another concern is that one component of this tool relies completely on network connectivity and database server availability (e.g.open query-request for selected databases) to query and retrieve literature and analogs from the public domain.To partially address this need, a future avenue includes the use of a periodic "database dump" that can be locally accessed and queried in lieu of internet access, when connectivity is unreliable or inaccessible.
Overall, we believe the identification of literature evidence as well as close analogs obtained through CACTI will aid researchers in early stages of drug discovery pipeline, and consequently allow more work on structure-activity relationship analysis.Similarly, the target prediction module implemented in this tool will serve as starting points for subsequent drug-target validation efforts and support the translation of genomic data into effective new drugs through the comparison of scaffold families.This automated tool shows a promising approach to quickly investigate multiple chemical entities in a single query, and prioritize hits for further exploration, especially in academic settings where compound management and resupply is limited.

Fig. 2
Fig. 2 Identification of synonyms, references, and close analogs for 374 Pathogen Box compounds.A Pathogen Box querying workflow.Results obtained from querying 374 Pathogen Box compounds against the pipeline.Supplementary tables related to each analysis are indicated along with the results from each module.B Membership of selected compounds.Number of compounds active against each neglected disease according to the Pathogen Box reference biological data [40].C Synonyms identified.Number of names given for each Pathogen Box compound across different studies.Well established drugs belonging to the Pathogen Box are shown along with the common drug name.A dotted line denotes the average (n = 11) of synonym names per compound.D Statistics for identified information per compound.Identified categories for the 374 compounds are shown in the X axis.Left Y axis shows the number of Pathogen Box compounds found per category and the right Y axis shows the number of pieces of evidence per category.E Pieces of literature found per compound.Scatter plot showing the total pieces of evidence for each compound (X axis) against the frequency (Y axis).Known inhibitors with their inhibitory protein/enzyme are shown

Fig. 3
Fig.3 Chemical clustering and target prediction.A Scaffold clustering.Structural similarity calculations with Pathogen Box library and known drug targets, as determined with RDKIT fingerprints.Clusters were assigned using 80% Tanimoto similarity threshold.Visualization of chemical space was performed using scikit-learn t-SNE function.B Scaffold clustering statistics.Total number of Pathogen Box members (light blue) and clusters (light purple) identified for the different categories shown in X axis.Total of known drug-target annotations, and predicted drug-targets, among all clusters are included in the summary categories.C Overrepresented clusters with known target.Examples of clusters with drug-targets found at rates greater than expected according to the hypergeometric mean function.Structures for cluster members are included

Table 1
Approximate number of records by entity type for selected chemogenomic and literature databases