Prediction of repurposed drugs for Coronaviruses using artificial intelligence and machine learning

Graphical abstract


Introduction
The 21 st century has experienced three novel coronavirus (CoV) pandemics caused by the Severe Acute Respiratory Syndrome Virus (SARS), Middle East Respiratory Syndrome Virus (MERS), and Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). The first SARS epidemic from November 2002 till July 2003 led to around 8,000 reported cases, including about 700 deaths worldwide (https://www.who.int/csr/sars/country/table2004_04_21/en/). After about ten years, in June 2012, a second global CoVs outbreak, i.e., MERS, continued until 2016, resulting in around 1,700 confirmed cases, including about 620 deaths globally [1]. The third and ongoing SARS-CoV-2 pandemic, officially declared by WHO in January 2020, has led to around 100 million global cases, including around 3 million deaths as of April 2021.
Presently, the ongoing SARS-CoV-2 global pandemic requires an urgent need for antiviral therapeutics to control its spread. Lack of effective therapeutics to date necessitates the development of predictive computational tools that can speed up and support the existing/ongoing experimental approaches for drug repurposing. Molecular docking and dynamic simulations based on virtual screening to identify antiviral compounds against SARS-CoV-2 have already been explored in this context [25,26]. Repurposed drug identification by machine learning techniques (MLTs) based approaches is less explored in CoVs' drug discovery venture to date. The MLTs based predictive algorithms have previously been employed in the development of various antiviral predictors viz., AVPpred [27], AVP-IC 50 Pred [28], HIVprotl [29], anti-flavi [30], anti-nipah. However, our group recently developed a comprehensive platform for analysis and identification of the epitopes for the CoVs named 'CoronaVR' [31]. The input anti-CoVs data in the current study was taken from our recently published comprehensive database of the experimentally validated repurposed drug database named 'DrugRepV' [32]. In the current study, we have identified repurposed drug candidates (against SARS-CoV-2, SARS, and MERS) using different MLTs like Support Vector Machine (SVM), Random Forest (RF), k-Nearest Neighbour (KNN), Artificial Neural Network (ANN), and Deep Learning [Deep Neural network (DNN), Artificial Intelligence]. Further, we also predict the effective anti-Corona compounds after scanning the DrugBank repository through the developed predictive models.

Results
The robust prediction models were developed using various MLTs like SVM, RF, KNN, ANN, and DNN. The efficacies of the training/testing and independent validation dataset were checked using the performance parameters like Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Coefficient of Determination (R 2 ), and Pearson's Correlation Coefficient (PCC or R). The chemical analysis was also performed on the anti-CoVs (SARS, MERS, and SARS-CoV-2) compounds. Further, the drug repurposing was done by scanning the DrugBank through the developed machine learning models.

Quantitative structure-activity relationship model development
For SARS, various prediction models were developed using the MLTS like SVM, RF, KNN, and ANN. The performance of the training/testing dataset with 198 datasets was calculated using the 10-fold cross-validation (Table 1). The prediction model developed using the training/testing dataset achieved a PCC of 0.92, 0.76, 0.76, and 0.73, from SVM, RF, KNN, ANN, respectively. In contrast, the 23 sequences of the independent validation dataset give an accuracy  Table S2). The prediction models were also developed for the MERS using 10-fold cross-validation on training/testing and independent validation datasets ( Table 1). The training/testing with 110 datasets displayed a PCC of 0.92, 0.60, 0.65, and 0.49, respectively, for the SVM, RF, KNN, and ANN algorithms. While for the 13 independent validation datasets, the MLTs lead to the PCC of 0.92, 0.74, 0.69, and 0.50 correspondingly (Table 2). However, the PCC of the training/testing and independent validation dataset are 0.53 and 0.53, respectively, for the DNN machine learning (Supplementary Table S2).
The training/testing dataset shows the PCC of 0.84, 0.50, 0.50, and 0.62, respectively, through the SVM, RF, KNN, and ANN algorithms. However, the independent validation dataset resulted in the PCC of 0.92, 0.50, 0.67, and 0.68 correspondingly on the MLTs ( Table 2). The training/testing and independent validation datasets show the PCC of 0.70 and 0.51, respectively, for the DNN machine learning (Supplementary Table S2).
The Overall CoVs include unique entries from the SARS, MERS, and SARS-CoV-2 datasets. The overall entries were split into the training/testing and independent validation datasets with 372 and 42 entries via the randomization approach available in SciKit library ( Table 1). The training/testing dataset provides the PCC of 0.73, 0.58, 0.57, and 0.68, respectively, during 10-fold crossvalidation through SVM, RF, KNN, and ANN. In comparison, the independent validation dataset provides the PCC of 0.75, 0.49, 0.58, and 0.67 correspondingly for the MLTs (Table 2). However, the PCC of the training/testing and independent validation dataset are 0.61 and 0.67, respectively, for the DNN machine learning (Supplementary Table S2).

Applicability domain analysis
The applicability domain was calculated between the leverage and the standardized residuals among the best performing SVM models. All the models of SVM on the SARS, SARS-CoV-2, MERS, and overall CoVs are highly robust with the leverage (h*) of 1.18, 1.20, 1.39, and 1.43 as shown in Fig. 1a. The actual and the predicted pIC50 plots among the SVM models of the SARS, SARS-CoV-2, MERS, and the overall CoVs, also show their robustness, as shown in Fig. 1b.

Validation using the decoy set
For all the developed models, the PCC values were calculated for the random decoy sets by comparing the predicted pIC50 of a decoy and its corresponding parent molecule. The SARS decoy dataset shows the PCC of 0.10, 0.08, and 0.03 on sets 1, 2, and 3, respectively. On SARS-CoV-2, we achieved PCC of 0.05, 0.01, and 0.05 on three sets. In the case of MERS, PCC of 0.11, 0.02, and 0.13 was obtained on sets 1, 2, and 3, respectively. The overall CoVs show the PCC of 0.06, 0.01, and 0.004 on set 1, 2, and 3, respectively ( Fig. 2 and Supplementary Table S9).

Chemical diversity of anti-Coronaviruses molecules
Binning clustering of 221 anti-SARS compounds with a similarity cut-off of 0.60 produced 101 bins. Similarly, binning clustering of 123 anti-MERS compounds with a similarity cut-off of 0.60 produced 53 bins. Futhermore, binning clustering of 142 anti-SARS-CoV-2 compounds with a similarity cut-off of 0.60 produced 131 bins. Multidimensional scaling at 3D showed the diversity of the anti-SARS-CoV-2 compounds in the chemical space Fig. 3a. Hierarchical clustering of the anti-SARS-CoV-2 compounds using the single linkage method provided the hierarchy of compound clusters provided in the form of circular plots, which shows high chemical diversity in among them Fig. 3b. However, the 3D multidimensional scaling and the hierarchical clustering of SARS and MERS are shown in Supplementary Fig. S1. The 3D multidimensional scaling shows that all the anti-corona compounds are highly dissimilar in chemical structures. The anti-SARS-CoV-2 compounds are in more chemical diversity, followed by the anti-MERS and the anti-SARS.

Molecular docking
The molecular docking technique is highly beneficial for understanding the protein-ligand interactions and bond lengths among them. We have selected the top 20 compounds out of 80 predicted molecules for SARS-CoV-2 based on their predicted high pIC50 value. These compounds were docked sequentially on SARS-CoV-2 S protein (PDB: 6lzg) to calculate their best binding affinity in Kcal/mol. The detailed result of their binding affinities are shown in Supplementary Table S10. Analysis of binding affinity showed that 15 out of 20 compounds have binding energies ranging from À6.8 Kcal/mol to À9.5 Kcal/mol. These 15 compounds were selected for the interaction with SARS-CoV-2 S-protein (PDB: 6LZG), and their comprehensive list is represented in Table 4. Additionally, 06 molecules Verteporfin, Alatrofloxacin, Metergoline, Rescinnamine, Leuprolide, and Telotristat ethyl with binding energy ranging from À8.0 Kcal/mol to À9.5 Kcal/mol and their interacting residues are displayed in Figs. 4 and 5.
Interaction analysis of Verteporfin revealed 03 interactions with the N-terminal domain (NTD) and 01 interaction with the Cterminal domain (CTD) of the SARS-CoV-2 S-protein complex with ACE2 receptor. These interactive residues were SER-77, TRP-203, ASP-206, and GLU-398, which showed the conventional hydrogen bond and carbon-hydrogen bond as shown in Fig Table 4 represents the interacting residues, interacting domain of the protein, type of interactions, as well as bond length of the 06 ligands mentioned above.

Status of the predicted repurposed drugs in literature
Apart from performing the cross-validation, internal validation, and applicability domains, we also checked the literature to find support for the experimental validation of our predicted repurposed drugs. For the same, we searched the predicted drugs from our pipelines with the (if provided) inhibition efficiencies reported in the literature through in vivo, in vitro, and computational approaches ( Supplementary Fig. S5). The detail of the top hits predicted from our pipeline, DrugBank ids, drug name, primary indication, and testing status are provided in Table 3.
Thus, this analysis demonstrates the robustness of our prediction algorithm, which further suggests that the predicted drugs will show promising results against the SARS-CoV-2.

Discussion
Currently, the world is facing the crisis of SARS-CoV-2 infection, which has led to millions of deaths. Apart from present pandemics of the SARS-CoV-2, other CoVs like SARS and MERS also caused various epidemics/pandemics in past years [33]. Numerous researchers around the world are focusing on developing drugs against the SARS-CoV-2. Drug development is a very complex and timeconsuming process. However, in the current scenario of the SARS-CoV-2 pandemic, the need for effective antiviral drugs is critical. In this regard, computational interventions would be an essential step to speed up the research. Researchers have already used different computational approaches to find potential drugs against SARS-CoV-2 infection. To mention a few, Chen TF et al., have developed a drug database, DockCoV2, for SARS-CoV-2 which focuses on predicting the binding affinity of FDA-approved and Taiwan National Health Insurance drugs [46]. Another web server, DockThor-VS, developed by Guedes IA et al., provides a virtual screening (VS) platform with curated structures of potential therapeutic targets from SARS-CoV-2 incorporating genetic information relevant to non-synonymous variations [47]. In another study, Li R et al., used network pharmacology-based computational analyses to understand and characterize the binding capacity, biological functions, pharmacological targets, and therapeutic mechanisms of niacin in colorectal cancer (CRC)/COVID-19 [48]. Again, Kumar A et al., have used a cheminformatics approach to create different datasets and analyzed scaffold diversity to predict the SARS-CoV-2 inhibitors [49]. Recently, Beck B et al., used a pre-trained deep learning-based drug-target interaction model called molecule transformer-drug target interaction (MT-DTI) to identify commercially available drugs that could act on SARS-CoV-2 proteins [50]. Further, Zhou Y et al. group published their work of integrative network-based systems pharmacology methodology for rapid identification of repurposable drugs and drug combinations for the potential treatment of 2019-nCoV/SARS-CoV-2 [51]. Mainly the inhibitors were designed against the main protease (M pro ) of SARS-CoV-2 using in-silico molecular docking approach. However, the machine learning based approaches are less explored to predict the drugs against SARS-CoV-2 infection.
MLTs based methods using the experimentally validated chemicals/drugs for anti-CoVs activity are lacking. The current study is focused on predicting the efficient and novel drug repurposed candidates for the CoVs, SARS-CoV-2, MERS, and SARS. We extracted the experimentally validated drugs/compounds tested for antiviral activities for CoVs from the 'DrugRepV' database. To develop the prediction algorithm, we explored 17,968 chemical and structural descriptors (one dimensional 1D, 2D, and 3D) as well as fingerprints. For the prediction algorithm, we used highly robust methods like feature selection, internal and external validation, MLTs, and applicability domains. Among all MLTs used in developing the predictive models, the SVM outperformed the RF, KNN, ANN, and DNN. The PCC of the SVM model of the CoVs, i.e., SARS, SARS-CoV-2, MERS, and overall ranges from 0.73 to 0.92 on the training/testing datasets. However, the independent validation datasets performed equally well.
Further, the robustness of the model was cross-checked by plotting the applicability domain, and actual vs. predicted pIC50 values. William's plots are used to calculate the applicability of the predictive models and confer the robustness of all the models. Likewise, the analysis of the actual vs. predicted plots also validated the robustness of our models. We have also checked the robustness of the model by using external validation datasets and decoy sets. Using the external validation datasets, we achieved PCCs ranging from 0.60 to 0.90. In comparison, the decoy datasets  have PCCs from 0.004 to 0.09. In earlier studies also, the decoy sets had low efficiency compared to corresponding developed models demonstrating the robustness of our computational models for each [30,41]. Chemical clustering is often used to understand the distribution of compounds in the chemical space. Binning clustering method aggregates chemical compounds to a user-defined similarity cutoff. Here a Tanimoto coefficient (Tc) (proportion of the features shared between two compounds divided by their union) of 0.60 was used. The Tc ranges from 0 to 1, where a higher value indicates the greater similarity of the compounds under investigation. So, using a Tc of 0.60 joined the compounds with 0.60 or higher similarity values together into multiple clusters. As there are many clusters present per 'anti-corona' compound groups, the compounds are well dispersed in the chemical space. The multidimensional scaling (MDS) uses the classical multidimensional scaling 'cmdscale' function implemented in R and takes a matrix of 'item to item' distances as input. Each item is assigned with a coordinate, and the 'item to item' distances are then displayed in 2D and 3D scatter plots. The MDS plots generated in the analysis showed that each group of 'anti-SARS', 'anti-MERS', 'anti-SARS-CoV-2 0 as well as the overall 'anti-corona' compounds are well dispersed in the 2D and 3D chemical space. On the other hand, the hierarchical clustering uses the 'hclust' function of R and requires a distance matrix input of 'all-against-all' compound distances. The 'all-against-all' distance matrix is generated by subtracting the Tc similarity measure from one (1-Tc). Both the hierarchical clustering circular plots generated in the analysis show that the anti-corona compounds are highly dissimilar in their structural features.
Since, drug development is a very complex and time-consuming process, from the start of the SARS-CoV-2 pandemic, several research groups have been trying to identify efficient repurposed drug candidates via computational, in vitro, and in vivo studies. So our developed computational predictive models were used to identify the repurposed drug candidates from the ''approved" drug category of the DrugBank database. Further, we checked the predicted repurposed drug candidates using our pipeline, which have been already validated in the literature. Interestingly, we found that a few top hits from our study have been efficiently validated. Thus, it further confirms the robustness of our predictive pipeline. Among the top 10 drug candidates for the SARS-CoV-2 virus with the lowest IC 50 i.e., Verteporfin has been already validated as the potential ACE2 inhibitor in the in vitro and mouse model [34], which has primarily been used to treat age-related degeneration [35], and various types of cancers like prostatic cancer, breast cancer, etc [36]. The Guanfacine drug, which is primarily used to treat Attention Deficit Hyperactivity Disorder (ADHD), is already in use to treat Delirium condition in COVID-19 patients [37]. Likewise, the Trovafloxacin drug, which is a broad-spectrum antibiotic, has been predicted to be an efficient Main protease (M pro ) inhibitor in a docking study done by Gimeno A, et al. [38]. The Argatroban drug, which was earlier used as a thrombin inhibitor also shows promising inhibition against SARS-CoV-2 [39]. The Reboxetine drug, which was initially used to treat clinical depression, shows promising results in the in vitro study with DG binding (kcal/mol) of À8.86 and inhibiting M Pro [40]. Therefore, the repurposed drug candidates predicted by our pipeline could be beneficial to speed up the research in the field of CoVs inhibitors.
Molecular docking and molecular dynamics methods are used as a well-reasoned strategy that provides valuable insights regarding the physicochemical properties of molecules of interest. It also provides the information about the interaction and reactivity of the molecules as potential drug candidates [42]. Few literature reports have identified the repurposed drugs that targets SARS-CoV-2 Spike protein [43][44][45]. Current study identifies 06 ligands molecules with high binding affinity, i.e. Verteporfin, Alatrofloxacin, Metergoline, Rescinnamine, Leuprolide, Telotristat ethyl against the SARS-CoV-2 S-protein complex with ACE receptor. We found the binding affinity of Metergoline and Rescinnamine, i.e., À8.8 Kcal/mol and À8.5 Kcal/mol, respectively in this study. These findings correspond with the previous study of Chen T-F. et al., which showed the docking scores of À8.4 and À7.5, for Metergoline and Rescinnamine respectively, against SARS-CoV-2 Spike-RBD [46]. Therefore, the present work can contribute to identify the efficacious repurposed drugs against SARS-CoV-2 through computational approaches.
Leveraging this we have developed an AI and MLT based predictor named 'anticorona' which includes modules of predictive models for CoVs including SARS-CoV-2, SARS, and MERS, with high performance. We have also ensured the robustness of the predictive models using i) external independent validation datasets, ii) decoy datasets, iii) applicability domain, and iv) chemical analyses. The developed models were used to predict promising repurposed drug candidates against CoVs after scanning the DrugBank. Top predicted molecules for SARS-CoV-2 were further validated by molecular docking against the spike protein complex with ACE receptor. We found potential repurposed drugs namely, Verteporfin, Alatrofloxacin, Metergoline, Rescinnamine, Leuprolide, and Telotristat ethyl with high binding affinity. Furthermore, some of the predicted drugs for the SARS-CoV-2 have already entered the clinical trials as interventional drugs like Argatroban, Metmorfin, Amlodipine, Simvastatin, Isavuconazium and Diosmin. Likewise, some drugs were also predicted through computational approaches by other groups. These findings confirm the predictive power of our computational models. We anticipate these computational methods would assist in antiviral drug discovery against SARS-CoV-2 and other CoVs. In the current scenario of SARS-CoV-2 pandemic, the researchers can directly use the predicted repurposed drug candidates, which would save their money and time in developing the promising therapeutic candidates.

Datasets
The dataset of the inhibitors of CoVs used in the study has been extracted from our recently published DrugRepV database [32] along with the information of inhibition efficiency, chemical information (SMILES). We used three important CoVs namely SARS, SARS-CoV-2, and MERS in the analysis. Further, we predicted the repurposed drug candidates using MLTs for four categories of viruses i.e. overall CoVs, as well as individual SARS-CoV-2, SARS, and MERS. The datasets used in the analysis are available as Supplementary Tables S5-S8. The overall methodology is described in Fig. 6. The following steps have been used: 1. The SARS, SARS-CoV-2, MERS, and overall CoVs have 380, 342, 401, and 1123 inhibitor entries respectively. 2. Further, quality control involves filtering the entries with IC 50 /EC 50 , SMILES, and unique entries per category. 3. The IC 50 /EC 50 were converted into the negative logarithm of half-maximal inhibitory concentration (pIC 50 ) using the formula (pIC 50 = -log 10 (IC 50 (M)), where the IC 50 would be in Molar concentration. 4. After the quality control, we obtained 212, 142, 123, and 414 unique entries for SARS-CoV-2, SARS, MERS, and overall CoVs. 5. The dataset is divided into the training/testing and independent validation datasets using a randomization approach. It resulted in the 221 T200+V21 , 142 T128+V14 , 123 T111+V12 , and 414 T373+V41 entries for SARS, SARS-CoV-2, MERS, and overall CoVs correspondingly.
6. Calculation of the 1D, 2D, 3D, 4D molecular descriptors, and fingerprints was extracted using PaDel software. 7. Feature selection algorithms were performed to get the most relevant features among all four categories. 8. The prediction model is developed using various MLTs like SVM, RF, ANN, KNN, and DNN.

Descriptors extraction
In order to develop the CoVs-specific prediction models, from the anti-corona compounds, we used the PaDEL-Descriptor software [52]. We calculated the 1D, 2D, 3D molecular descriptors, and fingerprints totaling up to 17,968 features. The molecular descriptors are the pieces of information encoded in the molecular structure of a chemical. They are classified according to their dimensionality, viz., 1D, 2D, and 3D. The 1D descriptors present the very basic information calculated from the molecular formula like molecular weight. The 2D descriptors like the number of bonds, connectivity indices, etc. describe the signatures calculated from two-dimensional molecular representations, intramolecular hydrogen bonding, etc. The 3D descriptors, as the name suggests, describe the molecular properties related to three-dimensional conformations of the molecule such as solvent accessible surface areas, intramolecular hydrogen bonding, etc. The fingerprints are another way of representing molecules as mathematical objects where binary digits (bits) are used to find and/or differentiate molecular substructures. Together, these descriptors and fingerprints are necessary for establishing a quantitative structure-activity relationship (QSAR) of the chemical compounds under study [53]. These descriptors are very important as used previously in various studies for predicting the inhibitors against various infectious agents [30,41,54].

Format conversion
We converted the anticorona chemical compound structures from the simplified molecular-input line-entry system (SMILES) format to the three-dimensional structure-data file (3D-SDF) format using the open-source chemical toolbox Open Babel version 3.0.0 [55]. This format conversion step is necessary for calculating the different descriptors and fingerprints for the curated anti corona chemical compound datasets.

Machine learning algorithms
For the development of the prediction algorithm, we used five different MLTs e.g. SVM, RF, KNN, ANN, and DNN which were called using the SciKit library of Python. While the DNN was run through the Keras Deep Learning Library.

Support Vector Machine
SVM is a supervised MLT used for solving classification and regression-based problems [56]. In the current study, we used SVM for solving the regression problem i.e. Support Vector Regression (SVR). The SVR works on the same principle as for SVM classification, with minor differences. In general, its main focus is minimizing the error, maximizing the margin by individualizing the hyperplane, such that some proportion of the error is being tolerated. It was customized by using the linear and non-linear SVR along with the kernels like Gaussian Radial Basis function and Polynomial.

Random Forest
RF is a supervised learning algorithm that uses an ensemble technique for predicting the classification and regression tasks [57]. It works by forming a forest of multiple decision trees from the training dataset followed by getting the prediction output by taking the mean of the prediction from individual trees for solving a regression task. For getting optimal output from the RF, we used attributes like number of trees (estimators), maximum depth of the trees (max_depth), minimum number of samples required to split an internal node (min_sam-ples_split), minimum number of samples required to be at a leaf node (min_samples_leaf), etc. In the case of the regression problem, it works by taking the mean of the predictions from individual trees.

k-Nearest Neighbor
KNN is a non-parametric MLT and works for both classification and regression problems [58]. It is an instance-based learning or lazy learning method, which depends on the contribution of the local data. It works by spreading the input as the k closest networks in a feature space. For the KNN algorithm, we used different nearest networks i.e. 3, 5, 7, 9, 11, etc.

Artificial Neural network
ANN is a supervised algorithm and consists of nodes and connected units. The collection of connected units and nodes known as artificial neurons, and shows analogy with animal brains [59]. It is an information processing technique, it includes a network of interconnected processing units, which works together to process information and give a meaningful output. For getting the optimized result, we used different activations (e.g. tahn, relu), sol-vers (e.g. sgd, adam), and learning rates (e.g. constant, invscaling, adative, etc.).

Deep Neural network
DNN is a type of ANN with multiple layers in between input and output layers. It is a feedforward network, where the data moves from input towards the output layers via the intermediate layers without moving in the backward direction [60]. It can be used to solve linear as well as complex non-linear relationships. The extra layers help the composition of the features from the lower layers for modeling the very complex data. We used Keras API of the Ten-sorFlow package for solving our regression-based problem. We used a combination of different optimizers (Adam, RMSprop, SGD, Adamax, etc.) and activations (tahn, sigmoid, softmax, etc.) to get the best result. We used 06 intermediate layers with different numbers of neurons in each layer like 256, 128, 64, 32, 16, and 08. Fig. 6. The overall methodology used in the study. The inhibitors of the Coronaviruses (SARS, SARS-CoV-2, and MERS) were extracted from the literature. Splitting of the dataset into the training/testing and independent validation using randomization approach. The descriptors were calculated using PaDel software followed by the selection of relevant features. The prediction model is developed using machine learning algorithms like Support Vector Machine, Random Forest, k-Nearest Neighbor, Artificial Neural Network, and Deep Neural Network.

Feature selection
The use of overall extracted 17,968 features in the development of machine learning would lead to various problems like overfitting, curse of dimensionality, etc. In this regard, feature selection would be an important step. We used the Recursive feature elimination (RFE) module of SciKit library in Python. The RFE extracts the features from the training dataset which are more relevant to predict the target variable [61,62]. In general, it uses two important attributes i.e. choice of algorithm and number of the features to be selected. In the current study, we used algorithms within the SVR method in the RFE module.

Performance measures
For regression (quantitative) mode, the correlation between two variables is measured using Pearson's correlation coefficient (PCC or R). In bioinformatics, the two variables are actual and predicted values. The range of PCC varies from À1 to + 1. If PCC is À1, it indicates that observed and actual values are negatively correlated, 0 shows random prediction, while +1 displays the positive correlation among them. PCC is calculated using formula: where n, E pred i and E act i is the size of the test set, predicted and actual efficiencies of CoVs inhibition respectively.
The coefficient of determination (R 2 ) is the statistical measure of determining the efficiency of a regression line to estimate the real data. The R 2 varies from 0 to 1, if it is near to 1 means the estimated rate of regression is perfect whereas towards 0 means imperfect estimation.
Mean Absolute Error (MAE) is the difference between actual and predicted values.

Applicability domain
The robustness of the predictive developed model was crosschecked by checking the applicability domains [29,30]. We used William's plot for checking the applicability domain. William's plot was plotted among the leverage and the standardized residuals for training/testing and independent validation datasets. Further, the robustness was also checked by plotting the actual values against the predicted values. The applicability domain was checked for both the training/testing or independent validation dataset. The robust predictive model was shown by the plot if the points of the actual and predictive values localized close to the trend line.

Decoy dataset
Decoy sets were generated for four categories, i.e. overall CoVs and individual SARS-CoV-2, SARS, and MERS, using RADER (RApid DEcoy Retriever) software [63]. We have used the default parameters used in the tool, i.e. Tanimoto threshold for Active ligand vs. Decoy and Decoy vs. Decoy is 0.75 and 0.50, respectively. For decoy selection, the ZINC database (17,900,742 entries) was selected. Decoys were randomly selected for all the categories using a random number generator program. Using this program, we have developed three random sets for each category of virus. For example, in SARS-CoV-2, each set contains 142 randomly selected decoys. Similarly, random sets developed for SARS (221), MERS (123) and overall (414).

Chemical analysis
Chemical clustering of the SARS, MERS, SARS-CoV-2, and overall unique compounds was done using the ChemMine Tools [64]. We performed the binning clustering using the Tanimoto coefficient (similarity cutoff 0.6). MDS was done at 2D and 3D level using the same similarity threshold. Hierarchical clustering was performed for all the molecules where the heatmaps and circular plots of the heatmaps were constructed for each aforementioned compound group using the 'distance matrix' parameter and a 'single' linkage method.

Drug repurposing
Repurposing of the drugs against the SARS-CoV-2, SARS, and MERS coronaviruses was done using our developed predicted models. We predicted the repurposed drugs using the best performing SVM models in all three categories. For repurposing the drug categories the ''Approved" category of the drugs was downloaded from the DrugBank repository [65]. The descriptors and fingerprints of all the 2468 approved drugs were calculated using the PaDel software. Further, the descriptors of the approved drugs were used to predict the highly efficient drugs against all three categories of viruses.