Automated Framework for Developing Predictive Machine Learning Models for Data-Driven Drug Discovery

3 The increasing availability of extensive collections of chemical compounds associated with 4 experimental data provides an opportunity to build predictive Quantitative Structure-Activity 5 Relationship (QSAR) models using machine learning (ML) algorithms. These models can promote 6 data-driven decisions and have the potential to speed up the drug discovery process and reduce their 7 failure rates. However, many essential aspects of data preparation and modeling are not available in 8 any standalone software. Here, we developed an automated framework for the curation of 9 chemogenomics data and to develop QSAR models for virtual screening using the open-source 10 KNIME software. The workflow includes four modules: (i) dataset preparation and curation; (ii) 11 chemical space analysis and structure-activity relationships (SAR) rules; (iii) modeling; and (iv) 12 virtual screening (VS). As case studies, we applied these workflows to four datasets associated with 13 different endpoints. The implemented protocol can efficiently curate chemical and biological data in 14 public databases and generates robust QSAR models. We provide scientists a simple and guided 15 cheminformatics workbench following the best practices widely accepted by the community, in which 16 scientists can adapt to solve their research problems. The workflows are freely available for download and the

Introduction 1 mutagenicity, 37 as case studies to develop an integrated pipeline for data preparation and curation for 1 drug discovery and toxicity. An overview of the datasets is shown in Table 1.  2 Data files gathered from ChEMBL and PubChem databases have their particularities about 3 column names and exported files' extension when exported from the database. Therefore, considering 4 these differences among the raw data that will serve as input for the user, two separate workflows 5 were developed to treat bioassay data from PubChem Bioassay ( Figure 1a) and ChEMBL ( Figure  6 1b). In the PubChem Bioassay workflow, bioactivities (loaded in "CSV Reader" node) and chemical 7 structures (loaded in "SDF Reader" node) are merged using compound IDs. In the ChEMBL 8 workflow, a CSV file containing the Simplified Molecular Input Line Entry Specification (SMILES) 9 notations of molecules is read in using the "CSV Reader" node. The input parameters of each 10 workflow are configured separately. The user must double-click on them to open the configuration 11 window, load the curated dataset saved as SDF and CSV formats, then click the "OK" button. 12 Pf3D7: Plasmodium falciparum 3D7 strain; SmTGR: Schistosoma mansoni thioredoxin glutathione reductase; Ames: mutagenicity; hERG: human ether-a-go-go-related gene. Bioassay (a) and ChEMBL (b) data. These two workflows were prepared to deal with the particularities 4 of the gathered data from PubChem Bioassay and ChEMBL databases. 5 6 Data preparation 7 Next, the input data is passed through the "Data preparation" metanode to normalize or transform 8 different measures of binding affinity. For example, bioactivities (IC50 or EC50) on the mass scale (e.g., 9 µg mL -1 ) were transformed to the molar scale (µM mL -1 ), then normalized to negative logarithm (−log) 10 units (i.e., pEC50 and pIC50). Fundamentally, dose-dependent inhibition is a logarithmic phenomenon, so 11 it makes sense to work in this manner. Subsequently, the "Bioactivity cutoff" metanode is used to set a 12 threshold value that differentiates active/toxic compounds from inactive/non-toxic compounds. The 13 bioactivity cutoffs were selected according to hit and lead criteria in drug design. 38 Details on the threshold 14 values used in each dataset are shown in Table 1. The selected thresholds are based on data distribution 1 and on literature for that particular endpoint. It is worth noting that this step is data-dependent, and the 2 user must perform all the transformations according to their own data. 3

Structure standardization 4
Very often, public datasets contains chemicals represented in different formats due to the 5 experimental protocols they were evaluated or due to different protocols for drawing/storing chemicals. 6 To solve these problems, the "Structure standardization" metanode is employed to standardize and clean 7 all chemical structures according to protocols developed by Fourches and colleagues. 18-20 Explicit 8 hydrogens are added, whereas polymers, salts, metals, organometallic compounds, and mixtures are 9 removed to follow the best practices in data curation, since most of descriptor-generating software do not 10 properly process these structures, generating errors in descriptors and fingerprints calculation. In addition, 11 specific chemotypes such as aromatic rings and nitro groups are normalized, and valences are validated 12 or corrected. This allows for the detection of the most common errors in chemical structures, such as 13 abbreviations of functional groups (e.g., Phe as Phenyl) and incorrectly assigned valences or aromaticity. 14 Finally, International Chemical Identifier Keys (InChIKey) are generated for each entry in the dataset. 15 InChIKey is an efficient method for detecting duplicates, once molecules have been adequately 16 standardized. 17

Analysis of duplicates 18
Datasets ready for modeling must have unique compounds that are structurally different from all 19 other compounds in the dataset. However, the same compound may be present many times in the same 20 dataset. If modelers build models using datasets containing these structural duplicates in both modeling 21 and external sets, the predictivity of these models will be overestimated. 18-20 Therefore, duplicates must 22 be identified and removed prior to any modeling study. Here, InChIKey notations are used to 23 automatically identify duplicate entries in the "Analysis of duplicates" metanode. Once duplicates are 24 identified, an analysis of their bioactivities is performed using the "Concordance" metanode. In this step, the intra-and inter-laboratory acticity/toxicity concordance between duplicate records is investigated to 1 ensure consistency and quality of the datasets. Lastly, duplicates are removed as follows ( Figure 2 Table 1. Dataset balancing 3 Data balancing will lead to loss of important data and might reduce chemical coverage. One should 4 always try to develop models with balancing the data. However, this is not always possible. Usually, in 5 HTS data, the number of active compounds is much smaller than the number of inactive compounds. 39 In 6 this case, binary ML models built from imbalanced datasets may be biased toward the prediction of the 7 majority class and may be poorly predictive for the minority class. 39 Another approaches for dealing with 8 unbalanced datasets are discussed elsewhere. 39 Considering the unbiased characteristics of studied 9 datasets (see details in Table 1), an under-sampling approach (i.e., reducing the size of the majority class) 10 is applied through the "Dataset balancing" metanode. 3 Datasets used in continuous ML models must be 11 saved without performing the balancing step. 12 The under-sampling strategy used here retains most of the representative structures of the majority 13 class, thus ensuring a structural diversity that is most representative of the original chemical space. 3

14
Initially, the Euclidean distances between each compound in the majority class and those present in the 15 minority class are measured using k-Nearest Neighbor (k-NN) algorithm. Then, representative molecules 16 with high, medium or low chemical similarity in the majority class were selected using k-distances and 17 extracted to generate balanced datasets. 40 The user must check out the number of compounds in minority 18 class to see how many compounds will be necessary to linearly select from the majority class. This number 19 must be inserted in the configuration of the node "Row sampling," in the "Absolute" field. Finally, each 20 balanced dataset may be saved by "SDF Writer" node in the user-defined directory. Details of the number 21 of compounds in each balanced dataset are shown in Table 1. 22

Automated model building 23
The developed workflow ( Figure 3) aims to simplify and automate the model-building protocol 24 according to the best practices for building predictive models. 2,21 The details of each step of modeling 25 procedure are covered in the following sections.

Input data 1
To run the workflow, the "SDF Reader" node must be loaded with a curated dataset in SDF format. 2 This node was configured to allow for a column containing chemical structures and associated biological 3 data (binary or continuous). Datasets with binary data must have the experiment result column labeled as 4 "outcome" and datasets with continuous data must have the pIC50 data labeled as "pIC50" (Figure 3). 5 6 7 Figure 3. General workflow for automated QSAR modeling of (a) categorical and (b) continuous data. 8 9 Molecular fingerprints 10 11 Molecular descriptors are the result of mathematical procedures that transform chemical fragments 12 of a molecule into relevant numerical data. The descriptors can be used to perform statistical/machine 13 learning analysis of the chemicals and to establish relationships between the chemical structure and its 14 bioactivity or another property of interest. 41 Fingerprints are a type of descriptors encoded as bit strings 15 which encode the absence (0) or presence (1) of a fragment or atom in a chemical structure. The current 16 workflow automatically calculates different fingerprint types, including substructure-based fingerprints (Avalon 42 ), and circular fingerprints (Morgan 43 and FeatMorgan 43 ). Within the "Fingerprints" metanode, 1 the user must click on "RDKit fingerprints" node to select the desired fingerprint type. In addition, circular 2 fingerprints may be adjusted according to bond radius and number of bits, whereas path-based fingerprints 3 may be adjusted by number of bits and path length. In this work, all fingerprints were generated for the 4 chosen datasets using radius 2 (for the circular fingerprints) and bit vector of 2,048 bits (for the circular 5 and path-based fingerprints). 6 Highly correlated descriptors are linearly dependent and have similar effect on the dependent 7 variable. 2 Some algorithms are more prone to bias if correlated variables are used. Although ensemble 8 trees are less prone to bias, removing correlated variables can make model building faster and facilitate 9 interpretation. After calculating fingerprint descriptors, the workflow calculates the correlation between 10 all descriptors and removes one of the highly correlated. 11

5-fold external cross-validation (5FECV) 12
After the molecular fingerprint calculations, the dataset is split into five subsets of equal size using 13 the "X-Partitioner" node. Four of these subsets form the training set (80% of the full set), while the 14 remaining subset (20% of all compounds) serves as the test set. This procedure is repeated five times, 15 allowing each of the five subsets to be used as a test set. Models are built using the training set while the 16 compounds in the test set (fold) are employed to evaluate the predictive performance. At the end of each 17 iteration, the "X-Aggregator" node collects the prediction results. All nodes in between these two nodes 18 are executed as many times as repetitions should be performed. 21

19
Applicability domain 20 One of the most important problems in any QSAR modeling is establishing the applicability 21 domain (AD). The AD must be determined for the given chemical space of predictive models in order to 22 localize "reliable" and "unreliable" regions for prediction. 44 Users should be able to trust the model's 23 predictions if they have evidence that the chemical space used for training matches the chemical space of 24 the compounds not previously seen by the model. We used the "Applicability Domain" metanode to estimate the AD of the developed models ( Figure 2). Within this metanode, the "Domain-Similarity" node 1 utilizes Euclidean distances to define chemical similarity among all training compounds and each 2 compound in the test set. This prediction may be unreliable if the distance of a compound not present in 3 the test set to its nearest neighbor in the training set is higher than an arbitrary parameter (Z = 0.5) that 4 controls the significance level. 45 5  To validate the workflow, we built categorical models using datasets previously studied by 3 our group (i.e., SmTGR, 3 Pf3D7, 6 Ames mutagenicity and hERG 46 ), as well as continuous models 4 using Pf3D7 dataset. For this purpose, KNIME contains nodes for most popular ML algorithms, such 5 as RF 13 and SVM 14 and the user can decide which algorithm to use. As an example, we only used RF 6 algorithm to build models for our case-studies. Categorical (Figure 2a) and continuous (Figure 2b) 7 ML models can be generated with any of the mentioned algorithms using the "Weka" or "Analytics" 8 nodes, respectively. Subsequently, optimization parameters of each algorithm may be adjusted by the 9 user in their corresponding node. For example, in the "RandomForest (3.7)", the number of trees and 10 maximum depth parameters may be adjusted to increase model predictivity and to avoid overfitting. 11 After completion of the model building step, the output of the developed model is saved in the user's 12 defined working directory using the "Weka Classifier Writer (3.7)" node (categorical models) or 13 "Model Writer" node (continuous models). 14

Performance of ML models 15
The external predictivity resulting from a 5FECV procedure can be adequately assessed with 16 the "Statistics" metanode. During this step, the categorical models are evaluated using correct 17 classification rate (CCR), sensitivity (SE), specificity (SP), positive predictive value (PPV), and 18 negative predictive value (NPV), whereas continuous models are evaluated using correlation 19 coefficient ( 2 ), root mean square error of cross validation (RMSECV), and predictive squared 20 correlation coefficient for the test set ( 2 ). After this step, users can easily access statistical results 21 of built models by right-clicking on the metanode and selecting the option "Connected to: Filtered 22 table". The statistical characteristics of the models developed in this study are summarized in Table  23 2. As one can see, all the models were robust and predictive, with CCR for external sets in the range 24 of 0.77-0.87. The binary SmTGR models showed predictive power similar to that obtained in our 25 previous ML study. 3 For the Ames mutagenicity data, all individual models presented a CCR of 0.01-26 0.02 and SP of 0.11-0.13, higher than the consensus model developed by Alves and coworkers. 47 For the continuous Pf3D7 models, the combination of fingerprints with Random Forest led to predictive 1 models (Table 2) with 2 values ranging between 0.72-0.73 and 2 of 0.73. The remaining models 2 were not compared with public models since they were developed using unexplored or old datasets. 3 Hence, the implemented protocol efficiently generates robust models with reliable predictive 4 performance. 5

Virtual screening 6
The VS is a computational procedure to filter down large chemical libraries (i.e., 10 5 to 10 8 7 compounds) to prioritize a smaller number of compounds that will then be tested experimentally (i.e., 8 10 1 to 10 3 compounds). 48 Although VS was introduced 30 years ago, many molecular databases still 9 contain errors and molecules with undesired physicochemical properties. In order to address this 10 issue, we developed a comprehensive workflow (Figure 4) for the VS of potentially active and non-11 toxic compounds by a practical application on the ChemBridge EXPRESS-Pick Collection. 49 12

Input data 13
To execute the workflow, the "SDF Reader" node must be loaded with the EXPRESS-Pick 14 Collection (504,599 compounds) or any other library in SDF format. 49 This node was configured to 15 provide a column containing chemical structures and associated physicochemical properties. 16

Data curation 17
The first step of the VS module contains a set of procedures to guaratee that structures are 18 well represented and standardized using the same protocol for those employed in generating the 19 models. In this step, structures are standardized, problematic molecules and duplicates are removed 20 from the library. Subsequently, 2D chemical structures are stored within a KNIME table, and tagged 21 with a ChemBridge identifier and experimental physicochemical properties. Several rules based on 22 molecular property distribution were developed to characterize specific subsets of chemical libraries 23 such as lead-like molecules (molecular weight ≤ 460, -4 ≤ LogP ≤ 4.5, LogSw ≥ -5, ≤ 5 H-bond 24 donors, ≤ 9 H-bond acceptors, ≤ 9 rotatable bonds, and ≤ 4 number of rings). 50 Finally, we adopted 25 the "Aggregator advisor" metanode as a filter to identify molecules that are known to aggregate or may aggregate in prospective experimental assays. The criteria used to predict high probability for 1 aggregation are LogP > 3.0 and Tanimoto coefficient ≥ 85% to the closest known experimental 2 aggregators. 51 After these steps, 244,194 compounds were excluded. 3 4 Figure 4. General workflows for the VS of new SmTGR (a) and Pf3D7 (b) hits 5

ML filtering 6
To perform ML predictions, the "Model Reader" nodes must be loaded with output files from 7 the most predictive models, while the fingerprints must be defined in the "RDKit fingerprints" nodes 8 using the same modeling parameters. In parallel, the "SDF Reader" nodes must be loaded with 9 corresponding curated datasets in SDF format to estimate applicability domains. Finally, the most 10 promising hit compounds appearing at the top of the VS list can be exported through the "SDF Writer" 11 and "Excel Writer (XLS)" node. The user can set their own hit criteria configuring the meta-node 1 "Selection" after the ML predictions. 2 After data curation, we performed two independent VS campaings on the compounds from 3 the ChemBridge EXPRESS-Pick Collection using developed models. The first one (Figure 4a   In this work, we developed an automated computational workflow for building robust and 1 predictive QSAR models employing ML algorithms following the best practices for model 2 development and validation. The worfklow has four modules, containing data curation, modeling, 3 chemical space analysis (Support Information), and VS. Moreover, we employed these workflows to 4 curate, analyze, and model four datasets previously used by our group, as a benchmark. This 5 workflow provides scientists a simple and guided open-source cheminformatics platform to for 6 development of QSAR models. This framework could be used as a validated starting point for 7 beginner cheminformatics scientists to implement modeling in their laboratory routines following the 8 best practices for building predictive models approved by the community. The workflows are freely 9 available for download at the LabMol GitHub 52 and in Supporting File S1.

General protocol 18
Initially, compounds with bioactivity data are retrieved from ChEMBL 9 and PubChem 10 19 database and imported separately into KNIME 3.6.0. 22,23 Subsequently, the potency values are 20 converted to −log units, and an activity threshold based is defined for discrimination between active 21 and inactive compounds. All chemical structures and correspondent biological properties are 22 carefully standardized and curated using Indigo nodes 57 and according to the protocols proposed by 23 Fourches and colleagues. 18-20 Then, curated datasets were balanced using a linear under-sampling 24 approach 3 implemented in Enalos nodes. 55 At the end of dataset balancing, molecular fingerprints 25 models are developed using RF algorithm 13 implemented in Weka nodes. 54 Models will be validated 1 using 5FECV procedure using X-Partitioner nodes. The AD was calculated using Domain-Similarity 2 node implemented by Enalos. 55 ML models are fully compliant to best practices for predictive 3 modeling, 16,20 and Organization for Economic Co-operation and Development (OECD) 4 recommendations, such as (i) a defined end point, (ii) an unambiguous algorithm, (iii) a defined 5 domain of applicability, (iv) appropriate measures of goodness-of-fit, robustness, and predictivity, 6 and (v) mechanistic interpretation, if possible. 59 Finally, ML models were saved using using "Weka 7 Classifier Writer (3.7)" node or "Model Writer" node and implemented as filters in VS workflow to 8 prioritize new compounds for further testing in experimental assay platforms. Supplementary information (detailed description of structure-activity relationships 13 workflow); source code of KNIME workflows (File S1 at ZIP format), and complete list of SmTGR 14 and Pf3D7 virtual hits (Files S2 and S3 at XLSX format, respectively) are available online. Rising Talents" for the awards and fellowships received, which partially funded this work. We are 24 thankful to Tesia Bobrowski for her kind help with editing the language of the manuscript.