In Silico Prediction Tool for Drug-likeness of Compounds based on Ligand Based Screening

DrugLikeness prediction is a time-consuming and tedious process. An in-vitro method the drug development takes a long time to come to market. The failure rate is also another one to think about in this method. There are many in-silico methods currently available and developing to help the drug discovery and development process. Many online tools are available for predicting and classifying a drug after analyzing the drug-likeness properties of compounds. But most tools have their advantages and disadvantages. In this study, a tool is developed to predict the drug-likeness of compounds given as input to this software. This may help the chemists in analyzing a compound before actually preparing a compound for the drug discovery process. The tool includes both descriptor-based calculation and ingerprint-based calculation of the particular compounds. The descriptor-calculation also includes a set of rules and ilters like Lipinski’s rule, Ghose ilter, Veber ilter and BBB likeness. The previous studies proved that the ingerprint-based prediction is more accurate than descriptor-based prediction. So, in the current study, the drug-likeness prediction tool incorporated the molecular descriptors and ingerprint-based calculations based on ive different ingerprint types. The current study incorporated ive differentmachine learning algorithms for prediction of drug-likeness and selected the algorithm, which has a high accuracy rate. When a chemist inputs a particular compound in SMILES format, the drug-likeness prediction tool predictswhether the given candidate compound is drug or non-drug.

Drug likeness prediction, ligand-based virtual screening, QSAR, molecular ingerprints, machine learning, prediction accuracy ABSTRACT Drug Likeness prediction is a time-consuming and tedious process. An in-vitro method the drug development takes a long time to come to market. The failure rate is also another one to think about in this method. There are many in-silico methods currently available and developing to help the drug discovery and development process. Many online tools are available for predicting and classifying a drug after analyzing the drug-likeness properties of compounds. But most tools have their advantages and disadvantages. In this study, a tool is developed to predict the drug-likeness of compounds given as input to this software. This may help the chemists in analyzing a compound before actually preparing a compound for the drug discovery process. The tool includes both descriptor-based calculation and ingerprint-based calculation of the particular compounds. The descriptor-calculation also includes a set of rules and ilters like Lipinski's rule, Ghose ilter, Veber ilter and BBB likeness. The previous studies proved that the ingerprint-based prediction is more accurate than descriptor-based prediction. So, in the current study, the drug-likeness prediction tool incorporated the molecular descriptors and ingerprint-based calculations based on ive different ingerprint types. The current study incorporated ive different machine learning algorithms for prediction of drug-likeness and selected the algorithm, which has a high accuracy rate. When a chemist inputs a particular compound in SMILES format, the drug-likeness prediction tool predicts whether the given candidate compound is drug or non-drug.

INTRODUCTION
Nowadays we can ind out different types of software and tools for predicting the drug-likeness on the web. But the problem with these tools are they are typically based on one method, even if there are other methods are available for drug-likeness prediction. Generally, they focus on only descriptorbased calculation or ingerprint-based calculation or docking. In our further studies, we found out the importance of a tool which incorporate all these methodologies. If we create such a tool, the accuracy rate is very high in the prediction of drug-likeness, or we can derive much better conclusion for the chemist who is eagerly waiting for the results. The fact is that there doesn't exist any other tools that incorporate all these methods. This increases the importance of our project. DruLiTo, Swiss ADME and Drug mint are the well-known drug-likeness prediction tools that are existing on the web into the problems of these tools: DruLiTo is an open-source virtual screening tool for the drug-likeness prediction. The problem with this tool is, the tool gets freezes when processing some compounds. Drulito only contains the rules and ilters for descriptor-based prediction. Another tool called SwissADME has the same problem which is the tool only focusing on descriptor-based calculation. Various studies and papers prove the accuracy of ingerprint-based prediction. So, for more accurate values, we need to combine these two methodologies. Drug Mint is a web server, which integrates these two methodologies, but it has a limited number of ingerprint types. The fact is that there doesn't exist any other tools that incorporate all these methods effectively. This increases the importance of our Drug Likeness Prediction Tool. The tool incorporates different ilters and rules for descriptor prediction. Because by using any single ilter or rule, we can't be able to make a better prediction. For example, when we use Lipinski's rule alone, the prediction of the drug-likeness is not that much accurate. When using only this rule for the prediction, we got 50% -80% wrong information in the output. By combining these ilters and tools, we can reduce the wrong information and derive much better results. When coming to the ingerprint-based prediction, we use ive different types of ingerprints with ive different algorithms. So, the tool can igure out which is the most accurate algorithm for the ive ingerprints and present the best-predicted output to the chemist.

QSAR
QSAR is a technique that tries to predict the activity, reactivity, and properties of an unknown set of molecules such as the binding af inity or hypothetical molecules or the toxic potential of existing with the help of effective molecular descriptors. The idea behind the QSAR modeling is that the candidate molecules which have the same molecular features will trigger the same biological responses.
This area of research is trying to establish a relation between structural and electronics characteristics among candidate molecules. We are well-known with the ADMET properties. This is the mandatory features of a drug compound. The ADMET includes adsorption, distribution, metabolism, excretion, and toxicity. Nowadays through QSAR modeling, we can also predict the candidate molecule is a drug or nondrug (Lo et al., 2018)  The initial phases of the QSAR technique are focused on the odd features of a molecule. Because of that we can estimate the biological responses on a particular molecule. This initial approach is called as 1D-QSAR. The researchers in this ield extended their studies by binding more than one molecule. It leads to the inding of the 2D-QSAR technique (Kwon et al., 2019).
The much better version of QSAR, known as 3D-QSAR technique maps the particular compound interactivity with the biological and chemical features into 3-D vector area. The main drawback is with 3-D QSAR is we are incapable to anticipate the exact location of the corresponding molecule. It is because the molecule is plotted into a 3-D vector space, that includes other replacement molecules. It is very hard to locate the target molecule without the structural information. By using alignment descriptors we can solve this problem (Eckert and Bajorath, 2007). The 4-D QSAR technique is developed to address the alignment issue in the 3-D QSAR and 4-D QSAR is also an extension of 3-D QSAR technique. This technique solve alignment problem by constituting each molecules in non-identical conformation, orientation and protonation. Then the fundamental QSAR algorithm will locate the particular molecule from the molecules represented in different positions. When comparing to other QSAR models,4-D QSAR is more practicable with different binding targets and it also solves the alignment issue. (Lill, 2007;Myint and Xie, 2010).

Descriptors
We discussed how we screen the candidate molecules in the drug discovery process. Now we are going to discuss descriptors which identify or categorizes the given compound into drug or non-drug. QSAR (quantitative structure-activity relations) models started to play a major role in drug discovery because of the cost of the methods like high throughput screening. This model requires good molecular descriptors which provides information about the molecular features for the target candidate molecule. Molecular descriptors are different in the basis of its underlying algorithm uses for the calculation and the type of molecular representation. Geometrical descriptors, topological indices, physicochemical and constitutional descriptors are some of the well-known types of descriptors.

Topological Descriptors
These descriptors are the 2D descriptors that mainly concerned with the internal alignment of molecular compounds. They considered as structure explicit descriptors because these descriptors are derived from the topological representation of compounds. In numerical form, it encodes the features like shape, presence of hetero atoms, multiple bonds, size and branching. It generally represents the connectivity of atoms or bonds. Because of including these kinds of features and properties it has a major role in biological activities, pharmacokinetic properties and physicochemical properties of the respective compounds. Numerical graph calculations are very necessary for the calculation of the topological descriptors because graphs contain a non-numeric form of compound molecular substructure. The commonly used topological descriptors are connectivity indices, Balaban j index, Zagreb indices, wiener index and kier shape. These descriptors categorize the compound molecules based on the shape, size, branching and lexibility.

Geometrical Descriptors
Geometric descriptors are calculated from the atoms 3D coordinate in a speci ic compound molecule. When comparing with topological descriptors, this kind of descriptors provides more information and discrimination when coming to the comparing compounds having the same structures. For geometric optimization a geometric overhead is needed to use in this kind of descriptors it will lead to exploiting the new pieces of information about lexible molecules having several molecular conformations. It increases complexity also. For this speci ic reason, these descriptors need alignment rules for comparing the candidate molecules.

Physicochemical Descriptors
From the 2D structure, we can analyze different physical and chemical properties. These properties in luence drug activities in the body. The proper features of the drug might increase market demand. So, by evaluating the chemical and physical properties we can also assist the drug discovery process by identifying the compound selected. This part indicates that we need to pay attention in physicochemical properties like solubility, permeability that decides the optimal potency (Khan, 2016;Brüstle et al., 2002).

Descriptor Rules
There are several descriptor prediction rules available. For our Drug Likeness Prediction tool we took six descriptor rules. The conditions for a compound to satisfy Lipinski rule and Ghose ilter are shown in Tables 1 and 2 respectively. Tables 3 and 4 shows the requirements of a compound to satisfy the CMC-50 rule and Veber rule respectively. The other two descriptor rules MDDR and BBB likeness conditions that should be satis ied by a compound are given in Tables 5 and 6.

Fingerprints
So, we discussed the descriptors and their calculations that are used for the descriptor-based calcula-tion. Now we are going to have an in-depth look into the ingerprint prediction part of our tool.
The most common problem when trying to measure the similarity between two molecules is the complexity in its molecular representation. To measure the similarity between two molecules in a computationally more straightforward manner, we have to minimize the complexity in representing molecules. The commonly used easier and simple representation of molecules is molecular ingerprints. The molecular ingerprints convert the molecule into a sequence of bits; by this simple representation, we can easily compare the similarity between two molecules.

Substructure Key Based Fingerprints
As the name denotes sub-structure key-based ingerprints make the bit string according to the presence of the keys. The ingerprint checks the substructure keys with the candidate compounds and generates the bit string depends on the presence of corresponding features. It is helpful when the list of structural keys is given but not in the case of absence of the structural keys.MACCS is considered as one of the commonly used ingerprint types. It is small in length. It is commonly used because it includes most of the features for drug discovery. It is available in both 166-bit and 960-bit format. It includes structural keys, which are used in the SMARTS pattern (Reker and Schneider, 2015;Cereto-Massagué et al., 2015).

Topological Fingerprints
They evaluate all the components of the molecules that merge in each stage of bond creation. It continues till a certain number of bonds and then applying a hash value to every analyzed component for gen-erating the ingerprints. Due to its working manner, we can convert any candidate molecule into a ingerprint. With the help of the hashing mechanism, we can adjust the length of the bit string. These features enabled topological ingerprints faster in substructure searching and iltering of candidate molecules. One disadvantage of this ingerprint type is a single bit cannot be tracked back to a speci ic feature. This may lead to bit collision. Atom pairs and Topological torsions are two types of ingerprints that come under topological ingerprints. Atom-Pairscontains 2 versions of hashed and normal ones.
Hashed ingerprint have 2048 bits, and ordinary has 16000+bits Topological Torsion consist of four different types of atoms are included in each fragment of the bit string conversion range based on the path (Lavecchia, 2015;Willett, 2006).

Circular Fingerprints
The extension of topological ingerprints is circular ingerprints. Instead of looking for the components and path up to individual bonds, this ingerprint type evaluates the environment of the particular atom with a given radius circle. For this reason, this is not meant for substructure searching or substructure queries. This type of ingerprint is mainly used for structure similarity searching.ECFP and FCFP are well-known types in circular based ingerprints. ECFP (Extended Connectivity Fingerprint) is mainly based on the Morgan algorithm. There are two types of ECPFs. One is ECFP4 and ECFP6. The main difference is the difference in the radius. One uses four as the radius while the other uses six as the range. FCFP (Functional Class Fingerprints) have a variation from ECFP. Because instead of indexing the environment of the particular atom, it's indexing the atom's role in the compound (Kumar and Zhang, 2015;Duan et al., 2010).

Naive Bayes
Naive Bayes algorithm is a classi ication model mainly based on the well-known Bayes Theorem.
The Naive Bayes classi ier also presumes the existence of a speci ic feature in class or category is independent to the existence of any other feature. Along with simplicity, this classi ication model is useful for size able data sets and easy to build. Because of these factors, Naive Bayes outperforms even highly advanced classi ication techniques.
Above, The posterior probability P(c/x) can ind out by dividing the product of the prior probability of category P(c) and the likelihood which is the probability of predictor given category P(x/c) by the prior probability of the predictor P(x). The Naive Bayes algorithm works as follows: First of all, we need to convert the input data set into a frequency table.
From the frequency table, the algorithm creates a likelihood table by inding out the probabilities. By substituting the probabilities into the Bayes equation, the algorithm can ind out the posterior probability for the classes. The outcome of the prediction is the category which has a higher posterior probability. Advantages of the Naive Bayes classi ier are: When comparing to other classi ication methods like logistic regression, Naive Bayes is fast and easy to implement. Naive Bayes uses less training data, and also the algorithm is expandable in nature or the algorithm linearly expandable with the count of data points and predictors. The Naive Bayes classi ier is ef icient to deal with the discrete and continuous data, and also the technique can make probabilistic predictions more accurately. This well-known classi ication algorithm is also used for the multi class and binary classi ication problems. Disadvantages of the Naive Bayes classi ier are: The Naive Bayes classi ier is enormously depending on its feature independence, and this is the most significant disadvantage of the classi ier too. Because it is tough to have a group of features which are entirely independent of each other in the real-life scenarios, if a categorical variable has a class and not being noticed in the training data then the Naive Bayes classi ier will set a zero probability to the class. It will be impotent to make a prediction. This problem is generally known as "Zero frequency" in the Naive Bayes classi ication (Sun, 2006;Vijayarani and Muthulakshmi, 2013).

K Nearest Neighbours
K Nearest Neighbours (KNN) is a simple, supervised ML algorithm which is generally suggested for the classi ication as well as the regression problems. But the industries are mainly depending on this algorithm in the classi ication predictive problems. KNN is de ined as an idle algorithm because the algorithm does not have a dedicated training part and uses all the input data for training in the time of classi ication. The algorithm does not assume anything about the underlying data. Because of this, the algorithm is also known as a non-parametric algorithm.
Working of a KNN algorithm: KNN algorithm predicts the new data point values based on the feature similarity, which further means that the values of the new data points are how similarly matches the points in the training data set. The working of the algorithm is as follows: We must feed the training as well as test data for implementing the KNN algorithm. Then, we need to assign the value of K, i.e. the nearest data points. K can be any integer. For each data point in the test data, the algorithm does the following: With the help of the Euclidean method, the algorithm calculates the distance between training data and test data. Sort them in ascending manner based on the Euclidean distance. Next, the algorithm will select the top K rows from the sorted array after that KNN provides a category to the test points based on the most recurrent class of these rows.
Advantages of the KNN algorithm are: Along with the simplicity, the KNN algorithm is bene icial for non-linear data classi ication because the algorithm doesn't assume anything about data. We can use the algorithm for regression as well as classi ication problems with a result having high precision.
Disadvantages of the KNN algorithm are: The prediction of KNN is slow in the case of big N. It is very sensitive to the irrelevant features. As we mentioned earlier the KNN algorithm stores all the training data. Because of this the algorithm is computationally a bit expensive and requires high memory storage when comparing to other models. (Jia et al., 2020;Shen et al., 2003).

Random Forest
The Random forest algorithm is an ensemble learning method which is generally suggested for classiication and regression problems. We already know that a forest is made up of a large number of trees and the number of trees increases the robustness of the forest. Likewise, the algorithm generates decision trees based on data samples and then predicts form each of the decision trees. The process of voting selects the best solution. It is also a supervised learning model which is more nuanced than a single decision tree because the algorithm reduces the over itting by combining the output. Working of Random Forest algorithm is as follows: First, the algorithm selects the random samples from the input data set. Now, it will build a decision tree for every sample. After that, it will collect the prediction result from every decision tree. The decision trees will carry out voting for every predicted result. The inal step is the algorithm selects most voted output as the inal prediction output.
Advantages of the Random Forest algorithm are: By combining the result of different decision trees, the algorithm overcomes the problem of over itting. When comparing to a simple decision tree, the random forest has less variance and work well for large data samples. The algorithm is very lexible because the scaling of data doesn't need and provides the right accuracy even after inputs the data without scaling. Even a large percentage of the input data is missing the algorithm can maintain good accuracy.

Disadvantages of Random forest algorithm are:
Complexity is considered as one of the most significant disadvantages of the random forests. When comparing to the regular decision trees, the creation of the random forest is time-consuming and required more computational resources. In comparison, with other algorithms, the random forest prediction process is very time-consuming, and for an extensive collection of decision trees, the prediction is less intuitive (Ani et al., 2018(Ani et al., , 2016.

Multi-layer perceptron
The Multi-layer perceptron or MLP is a type of algorithm in feed forward arti icial neural networks. MLP has a single hidden layer sometimes known as a vanilla arti icial neural network. MLP contains at least three layers named as the input layer, hidden layer and an output layer. The multi-layer perceptron considers every node as a neuron with an activation function, except for the input nodes. For training purposes, MLP uses the algorithm called back propagation. The noticeable differences from linear perceptron are the multi-layer perceptron consist of non-linear activation function and multiple layers. The MLP can also distinguish the data which is not linearly separable. How does a multilayer perceptron work: A fully connected input layer and output layer are the critical components of the multi-layer perceptron. Multiple hidden layers are there in between the layers as mentioned above of an MLP. The multi-layer perceptron includes fully connected input and output layers as same as the perceptrons. The difference is MLP contains multiple hidden layers in between input and output layers (Yosipof et al., 2018).

Figure 1: UI of Insilico Prediction Tool
Let's look into the algorithm of multi-layer perceptron: The inputs are forwarded through the multilayer perceptron by considering the value of the dot product of assigned weights with inputs. What lies in between the input layer and the hidden layer. It's almost the same with working of the perceptron. In the hidden layer, the algorithm assigns a value to the dot product, but we don't push forward this value. We already mentioned that the MLP uses the activation functions at each of their layers except the input layers. Sigmoid functions and Recti ied line a run its are examples of such activation functions. Through this activation functions, MLPs pushes the calculated output at the presentation layer. So, in the next step, the algorithm forwards the computed result at the hidden layer through the activation function and pass it to the next layer in multi-layer perceptron. It also carries forward the dot product of corresponding weights. The algorithm will redo the steps to the inal output layer. The result will be used for decision-making.
Advantages of the MLP: As you can understand, the multi-layer perceptron builds the basis for all the neural networks. And the multi-layer perceptron increases the computational power when dealing with the regression and classi ication problems. By multi-layer perceptron, we can also understand complex, and large models and also the computers are free from the xor problems (Radhika et al., 2019;Kavitha et al., 2017).

Convolutional Neural Networks
Convolutional Neural Networks or CNN is a kind of neural networks, which is generally used for image or visual data identi ication and classi ications. The algorithm is ef iciently used for areas like commercial recommended applications and natural language processing. There are different types of CNN's like 1D, 2D and 3D. Regardless of its different types, the algorithm is having the same traits and working based on the same approach. The difference between CNN's lies in the type of input data. For example, 1D CNN is used for natural language processing and 2D for image classi ication. The inputs to these CNN's are generally a sentence and pixels in image respectively. Working of the ilters is also distinguishing the CNN into these three types. When coming to the architecture of the convolutional neural networks, shared weights, and sparse connectiv-ity is two essential features. When comparing to multi-layer perceptron, CNN has a different architecture or connectivity between the neurons. In the connectivity pattern, we can set the width of the dataset. Assume that the width is three, then the neurons in the m layer will connect to three neurons in the below layer and it will go on till the last layer. The neurons in the same layer will share common boundaries, and we can categorize them. This category which shares the common boundaries are called ilters. In the m-1 layer, we have ive neurons and three different ilters. Through these ilters, we can generate better results to an input dataset. We use a non-linear function with the ilters. For our tool, we have applied the sigmoid function for the non-linearities. The most signi icant advantage of this architectural pattern is we can ind out a connection between neurons in the neighbouring layers (Chen et al., 2018) Shared weights are another feature of the CNN which means that the same parameters are shared among each ilter. We already mentioned about m and m-1 layers. From the igure, it is clear that each neuron in the m-1 layer shares standard weights with the neurons in the above layer m. By using ilters like, this will undoubtedly help in the identi ication of features nevertheless of their location in the input dataset. Shared weights also reduce the number of parameters. It will increase the ef iciency of the algorithm and its fast processing. CNN also uses ideas like max-pooling and average pooling to reduce the over itting of the input data set (Wallach et al., 2015;Li et al., 2017).

Database
Dataset sources from which data are taken for the training and testing purposes are given in Table 7.

Dataset
Training and test data set contains a total of 11449 compounds out of which 6449 drug compounds were taken from Chembl database, and 5000 non-drug compounds were taken from the NCI database (Roy and Kadam, 2007).

Tool Operation
The tool front-end is built on anvil platform, and the back-end works on Jupiter notebook based on python 3. The anvil app is a cloud-based web-app development platform for the easy creation of a web app. The tool takes input in SMILES format and options are given to the user for drug-likeness prediction based on two methods, i.e. De-script Prediction and Fingerprint based prediction. The descriptor prediction has six descriptor rules used to predict. It also allows the user to choose which type of ingerprint to be used in the ingerprint prediction technique. The tool allows ive ingerprint techniques to be used. They include Maccs166,RDK, AtomPairs, Topological Torsion, Morgan Fingerprints. The tool uses the random forest algorithm for the prediction as analyzed earlier. The UI of our tool would look like as in Figure 1.
The output of descriptor shows pass or fail results and ingerprint output shows whether the compound can become a drug or not. The images of a tool predicting drug and non-drug are shown in Figures 2 and 3.

RESULTS AND DISCUSSION
The result set contains the accuracy of various ingerprints on ive different algorithms. The analysis of basic MACCS-166 bit ingerprint is shown in Table 9. Here we have taken RDK Fingerprint given by the RD Kit Packages for analysis and results are given in Table 10. Tables 8 and 12 contains the result set of two topological ingerprints AtomPairs and Topological Torsion, respectively. The only circular ingerprint taken for analysis is Morgan Fingerprint, and the result set is shown in Table 11.

CONCLUSIONS
From the analysis, we came to know that the Random Forest algorithm gave better results with all the ingerprints used for prediction. It out performed other algorithms in similarity-based drug-likeness prediction. So we implemented a random forest algorithm into our tool for the drug-likeness prediction of ive ingerprint types. Our tool also shows the result set based on descriptor prediction, which has a total of six rules. The tool incorporating both the ingerprint and descriptor prediction gives a better result compared to other available tools.