In-silico Discovery and Simulated Selection of Multi-target Anti-HIV-1 Inhibitors

The multi-target quantitative structure-activity relationship (mt-QSAR) study of human immunodeficiency virus (HIV-1) inhibitors was addressed by applying a modest, hitherto active linear regression model based on the Genetic function approximation. QSAR studies were performed on two datasets of HIV-1 inhibitors targeted on integrase and reverse transcriptase, respectively. By using the genetic function approximation method, the collaboration among different set of inhibitors was exploited and an efficient multi-target QSAR modeling for HIV-1 inhibitors was obtained. The predictive quality of the mt-QSAR models was tested for an external set of 30 compounds, randomly chosen out of 150 compounds. The linear regression model based on the Genetic function approximation with eight selected descriptors was obtained. The accuracy of the Original Research Article Edache et al.; IRJPAC, 11(1): 1-15, 2016; Article no.IRJPAC.22863 2 proposed model is illustrated using the following evaluation techniques: cross-validation, validation through an external test set, applicability domain, and Y-randomization. We accordingly propose a quantitative model, and we interpret the activity of the compounds relying on the multivariate statistical analysis. This study shows that the prediction results demonstrated that the predictive capacity of the model was attractive, and it can be utilized for outlining comparable gathering of anti-HIV compounds.


INTRODUCTION
The handling of the acquired immunodeficiency syndrome (AIDS) is the utmost challenging worldwide medical problem. So far, there is no realistic cure for HIV/AIDS. "Highly Active Antiretroviral Therapy" (HAART) is recommended for the treatment of HIV [1]. HAART is an aggressive treatment of HIV where the combination of different antiviral drugs is used to suppress HIV replication and the progression of the disease [2]. Most of the current strategies for treating AIDS depend on inhibiting HIV-1 reverse transcriptase enzyme. Multi-drug resistance is one of the major immediate threats to human health today [3], trends in the incidence of HIV together with the development of multi-drug and extensively drug resistant strains of HIV raises the need to intensify the search for more efficient drugs to combat this disease. The majority of existing therapy methods have targeted the viral replication at reverse transcriptase (RT), integrase and protease enzyme [4,5]. However, the emergence of drug resistance has been observed [6], therefore, new therapeutic agents are still needed. Recently, a new class of therapeutic agents has focused on inhibiting HIV entry into cells, CD4 binding, co-receptor binding and membrane fusion such as T-20 [7].
The multi-target drug design method is an encouraging way to complement the current single-target process and an embarrassment of studies address the problem of target prediction [8] and multi-target structure-activity models [9,10]. The multi-target drug prediction is a current research topic in the field of drug design. Despite the positive results of the studies mentioned above, the considered models were still trained for each target separately. In this study, the multi-target QSAR study of HIV-1 inhibitors was addressed by applying a simple, yet effective linear regression model based on genetic function approximation, which is recently presented in machine learning community. QSAR studies were performed on two datasets of HIV-1 inhibitors targeted on integrase and reverse transcriptase, by using the GFA method, the collaboration among different set of inhibitors was exploited and an efficient multi-target QSAR modeling for HIV-1 inhibitors was obtained. The general descriptor features and drug-like features for compound description were ranked according to their jointly importance in multi-target [11,12] QSAR modeling respectively, which will offer useful hints for the design of novel multi-target HIV-1 inhibitors with increasing likelihood of successful therapies of HIV.
Computer-aided drug design techniques may play a very important role. These techniques are based on multi-target Quantitative Structure-Activity Relationship (mt-QSAR) studies. It means that they are models linking the structure of drugs with the biological activity against different targets [13]. This kind of study may also be useful in a Multi-Objective Optimization of desired properties or activity of drugs against different targets. There are over 5000 descriptors that may be comprehensive and used to solve these problem [14]. QSAR studies reported upto-date are based on descriptors and databases of structurally parent compounds relevant to only one viral species. Subsequently, the researcher interested in predicting, for example, the antiviral activity for a given series of compounds, has to develop as many QSAR equations as combinations of compound families versus viral species have to be predicted. Therefore, it is of major interest the development of a single unified equation explaining the antiviral activity of structurally heterogeneous series of compounds against as many viral species as possible [15]. In fact, other mt-QSAR approaches, with demonstrated usefulness, have been introduced recently in Medicinal Chemistry [16]. The results of this study will go a long way to authenticate the claims by QSAR expert and will as well enrich the database on 1-[(2hydroxyethoxy)methyl]-6-(phenylthio)thymine, indole β-diketo acid, diketo acid and carboxamide derivatives with anti-HIV-1 activity that can be used in drug discovery with the development of rational/QSAR tools for decision support in anti-HIV therapy.

MATERIALS AND METHODS
Our study was performed on two kinds of HIV target datasets conformed from a far-reaching literature review, which consisted of inhibitors with their binding affinities on HIV integrase and reverse transcriptase. These inhibitors are correspondingly referred as; integrase inhibitors, which inhibit the proviral DNA to insert into the host cell genome, and non-nucleoside reverse transcriptase inhibitors (NNRTI), which inhibit the virus by preventing the copying of its genomic DNA into proviral DNA for incorporation into the host cell DNA. The dataset containing 150 compounds with well-defined activity [17,18], was selected for QSAR study. The biological activity data in the form of IC50 and EC50 (molar concentration of the drug leading to 50% inhibition of enzyme) value in lm (micromoles) were converted into negative logarithmic dose in moles (pIC50) for mt-QSAR Analysis (Table 1).

Molecular Modeling and Generation of Molecular Descriptors
The dual core personal computer equipped with the operating system Windows seven was used for making calculations of this work. Structure of all the compounds was drawn using ChemDraw Ultra module of the program and transferred to Spartan'14 (2013) version 1.1.2 [19] module to create the three-dimensional (3D) structure. These structures were then subjected to energy minimization using molecular mechanics (MMFF). Energy minimized molecules were subjected to optimization via DFT (density function theory) method with B3LYP function [20] and 6-311G* basic set [21]. These methods have become popular in recent years because they can reach similar precision to other methods in less time and less cost from the computational point of view. The geometry optimization of the lowest energy structure was carried out without any symmetry constraints were also transferred to PaDEL-Descriptor [22] version 2.18 and were subjected to re-optimization (with the MMFF94 force field). Most stable structure for each compound was generated and used for calculating various physicochemical parameters used for the statistical analysis.

Variable Selection and Model Generation
Even though many molecular descriptors are available, only a subclass of them is statistically important in terms of correlation with biological activity. Therefore, it is very important to address the variable selection method for deriving the best QSAR model. GFA [23] approach were adopted to select the best possible variables as well as for the generation of QSAR models.

Genetic function approximation method
GFA [23] approach is a search method to find approximate solutions to optimization and search problems. GFA is conceived from

Validation of the QSAR Model
The predictive capability of the QSAR equation was determined using the leave-one-out crossvalidation method. The cross-validation regression coefficient (ܳ ௩ ଶ ) was calculated by the following equation: adj is interpreted similarly to the R 2 value, considering the number of degrees of freedom also. It is adjusted by dividing the residual sum of squares and total sum of squares by their respective degrees of freedom. The R 2 adj value diminishes if an added variable to the equation does not reduce the unexplained variance [24]. Subsequently, R 2 adj is used to compare models with different numbers of predictor variables.
A large F indicates that the model fit is not a chance occurrence. It has been shown that a high value of statistical characteristics is not necessary for the proof of a highly predictive model [25,26]. Hence, to evaluate the predictive ability of our QSAR model, we used the method described by Golbraikh and Tropsha [25] and Roy and Roy [26]. The values of the correlation coefficient of predicted and actual activities and the correlation coefficient for regressions through the origin (predicted vs. actual activities and vice versa) were calculated using the regression of analysis Tool-pak option of Excel, and other parameters were calculated as reported by the above authors [25,26]. The determination coefficient in prediction, ܳ ௧௦௧ ଶ , was calculated using the following equation [26]: Where ܻ ௗ ೞ and ܻ ௧௦௧ are the predicted value based on the QSAR equation (model response) and experimental activity values, respectively, of the external test set compounds.
ܻ ത ௧ is the mean activity value of the training set compounds. Quality factor (Q) is calculated as; Where R is variance and SEE is the standard error of estimate. Over fitting and chance correlation, due to excess number of predictor variables can be detected by Q value [27,28]. Positive value of this QSAR model suggests its high predictive power and lack of over fitting [29].
Further evaluation of the predictive ability of the QSAR model for the external test set compounds was done by determining the value of r 2 m by the following equation [26]: Where ‫ݎ‬ ଶ is the square correlation coefficient between experimental and predicted values of the test set compounds with intercept set to zero.
The value of ‫ݎ‬ (௧௦௧) ଶ should be greater than 0.5 for an acceptable model. The concept of r 2 m was not only applied to test set prediction, but it can as well be applied for training set if one considers the correlation between observed and leave-one out predicted values of the training set compounds [26]. Moreover, this can be used for the whole set considering Leave-one-out predicted values for the training set and predicted values of the test set compounds [23]. The r 2 m(overall) statistic may be used for selection of the best predictive models from among comparable models. The values of k and k′, slopes of the regression line of the predicted activity versus actual activity and vice versa, were calculated using the following equations [25]: where ܻ and ܻ ത are the predicted and experimental activities, respectively.
Further statistical significance of the relationship between activity and the descriptors was checked by randomization test (Y-randomization) of the models. The Y column entries were scrambled and new QSAR models were developed using same set of variables as present in the un-randomized model. We have used a parameter, ܴ ଶ , [30] Where N is the total number of compounds and P is the number of predictor variables.
To check the intercorrelation of descriptors, variance inflation factor (VIF) analysis was performed. The VIF value is calculated from: Where R 2 is the multiple correlation coefficient of one descriptor's effect regressed on the remaining molecular descriptors. If the VIF value is larger than 10, information of descriptors can be hidden by correlation of descriptors [31,32].  The values of R 2 r and R 2 were determined, which were then used for calculating the value of R 2 p . Models with R 2 p values greater than 0.5 are considered statistically robust. If the value of R 2 p is less than 0.5, then it may be concluded that the outcome of the model is merely by chance, and it is not at all well predictive for truly external data sets. In this data set, values of R 2 p for all the 100 models were well above the stipulated value of 0.5 (Table 2). Therefore, it can be concluded that besides being robust, the model developed is well predictive.

Fig. 7. The calculated PIC50 versus the experimental PIC50 for training set
The inter-correlation of the descriptors used in the QSAR model was very low (below 0.8), which is in conformity to the study that, for a statistically significant model, it is necessary that the descriptors involved in the equation should not be inter-correlated with each other [34]. To further check the intercorrelation of descriptors, VIF analysis was performed. In this model, the VIF values of these descriptors are (Tables 3 and  4 Table 3), which are less than the threshold value of 10 [31,32]. Satisfied with the robustness of the QSAR model developed using the training set, we have applied the QSAR model to an external data set constituting the test set. As the experimental values of IC 50 for these inhibitors are already available, this set of molecules provides an excellent data set for testing the prediction power of the QSAR model for new compounds. Table 1 represents the predicted pIC50 values of the test set based on model (1). The overall root mean square error (RMSE) between the experimental and predicted pIC 50 values was 0.6955, which reveals good predictability.
The estimated correlation coefficients between experimental and predicted pIC 50 values with intercept (r 2 testo ) and without intercept (r Predicted IC50 Experimental IC50 and 0.9562, which are well within the specified ranges of 0.85 and 1.15 [25]. The values of r 2 m(LOO) = 0.8899, R 2 pred = 0.9261, r 2 m(test) = 0.8381, and r 2 m(overall) =0.8919 were found to be in the acceptable range [26], thereby indicating the good external predictability of the QSAR model.
The Williams plot, the plot of the standardized residuals versus the leverage, was exploited to picture the applicability domain (AD) [35,36]. Leverage indicates a compound's distance from the centroid of X. The leverage of a compound in the original variable space is defined as: where Xi is the descriptor vector of the considered compound and X is the descriptor matrix derived from the training set descriptor values. The warning leverage (h*) is defined as: Where N is the number of training compounds, p is the number of predictor variables. From the Williams plot (Fig. 9 (Fig. 10), all the test set compound are inside the domain of the training set [37].    To examine the relative importance, as well as the contribution of each descriptor in the model, the value of the mean effect (MF) [37,38] was calculated for each descriptor. This calculation was performed using the following equation.
Where MFj represents the mean effect for the considered descriptor j, βj is the coefficient of the descriptor j, dij stands for the value of the target descriptors for each molecule and eventually, m is the descriptors number for the model. The MF value indicates the relative importance of a descriptor, compared with the other descriptors in the model. Its sign (+, -) indicates the variation direction in the values of the activities as a result of the increase or decrease in the descriptor values. The mean effect values are shown in Table 3. All descriptors were calculated for the sorts. The activity is assumed to be highly dependent upon the ATSc3, SCH3 and VPC-5.
In the model, a student's t-test was performed at a confidence level of 95% to confirm the significance of each descriptor. All the p-values (Fig. 3) of the descriptors were less than 0.05, indicating that the selected descriptors were statistically significant at the 95% level.

CONCLUSION
In this article, a QSAR study of 150 molecules showing HIV-1 inhibitor activity was performed based on the theoretical molecular descriptors calculated by the PaDEL-Descriptors software. The built model was assessed comprehensively (internal and external validation) and all the validations indicated that the QSAR model built was robust and satisfactory and that the selected descriptors could account for the structural features responsible for the HIV-1 inhibitors. The QSAR model developed in this study can provide a useful tool to predict the activity of new compounds and also to design new compounds with high anti-HIV-1 inhibitor activity. Normalized mean distance Experimental pIC50 Training Test