A Derived QSAR Model for Predicting Some Compounds as Potent Antagonist against Mycobacterium tuberculosis: A Theoretical Approach

Development of more potent antituberculosis agents is as a result of emergence of multidrug resistant strains of M. tuberculosis. Novel compounds are usually synthesized by trial approach with a lot of errors, which is time consuming and expensive. QSAR is a theoretical approach, which has the potential to reduce the aforementioned problem in discovering new potent drugs against M. tuberculosis. This approach was employed to develop multivariate QSAR model to correlate the chemical structures of the 2,4-disubstituted quinoline analogues with their observed activities using a theoretical approach. In order to build the robust QSAR model, Genetic Function Approximation (GFA) was employed as a tool for selecting the best descriptors that could efficiently predict the activities of the inhibitory agents. The developed model was influenced by molecular descriptors: AATS5e, VR1_Dzs, SpMin7_Bhe, TDB9e, and RDF110s. The internal validation test for the derived model was found to have correlation coefficient (R2) of 0.9265, adjusted correlation coefficient (R2 adj) value of 0.9045, and leave-one-out cross-validation coefficient (Q_cv∧2) value of 0.8512, while the external validation test was found to have (R2 test) of 0.8034 and Y-randomization coefficient (cR_p∧2) of 0.6633. The proposed QSAR model provides a valuable approach for modification of the lead compound and design and synthesis of more potent antitubercular agents.


Introduction
Tuberculosis (TB) is the most deadly bacterial disease caused by specie of bacteria known as Mycobacterium tuberculosis. In 2013, World Health Organization (WHO) estimated death of 1.5 million people, 9.0 million people living with tuberculosis, and 360,000 people who were HIV positive [1]. At present, pyrazinamide (PZA), para-amino salicylic acid (PAS), isoniazid (INH), and rifampicin (RMP) are the current drugs administered to patients suffering from tuberculosis.
The resistance of the M. tuberculosis to the current drugs led to development of new approach that is fast and precise and could be able to predict the biological activity for the new compounds against M. tuberculosis.
Meanwhile, a theoretical approach, quantitative structure activity relationships (QSARs), is one of the most widely used computational method which helps in designing drugs and predicting drugs activities [2]. QSAR model is a mathematical linear equation which relates the molecular structures of the compounds to their biological activities. In this research, a data set of 2,4-diquinoline derivatives which had been synthesized and evaluated as anti-Mycobacterium tuberculosis [3] has been selected for QSAR study. Few researchers [4][5][6][7] have established relationship between some antitubercular inhibitors like quinolone, chalcone, pyrrole, and 7-methyljuglone using QSAR approach. However, QSAR study has not been established to relate the structures and activities of 2,4-disubstituted quinoline derivatives as potent antitubercular agents. Therefore, this study aimed to establish a valid QSAR model that could correlate the structures of 2,4-diquinoline derivatives and predict their respective activities against Mycobacterium tuberculosis.

Material and Method
. . Data Set and Data Collection. The derivatives of 2,4disubstituted quinoline as potent anti-Mycobacterium tuberculosis that were used in this research were selected from the literature [3]. The chemical structures alongside with their biological activities of these compounds were presented in Table 1, while the equation below was used to convert the percentage activities to logarithm unit. )] (1) see [5] . . Structure Optimization. In order for the molecules to attain a stable conformer at a minimal energy, all the molecules were geometrically optimized with the aid of Spartan 14 V1.1.4 by employing Molecular Mechanics Force Field (MMFF) count to remove strain energy and later subjected to Density Functional Theory (DFT) by utilizing the (B3LYP) basic set [5].
. . Molecular Descriptor Calculation. Descriptor is a mathematical logic that describes the properties of a molecule based on the correlation between the structure of the compound and its biological activity. Descriptors calculation for all the inhibitory compounds were achieved using PaDEL-Descriptor software V2.20.
. . Normalization of Data and Pretreatment. The values for the calculated descriptors were normalized using (2) so that each variable will have the same prospect at the inception so as to sway the model [8]: where Y 1 is the descriptor value for each molecule and Y min and Y max are the minimum and maximum value for each descriptors column of Y. After successful normalization of the data, the data were further subjected to pretreatment in order to remove noise and redundant data.
. . Data Division into Training and Test Set. Kennard and Stone's algorithm approach was employed in this study to divide the data set into two compounds, a training set and a test, in proportion of 70 to 30%. The training set was used to develop the QSAR model while the test was used to confirm the developed model [9].
. . Development of the Model. Multilinear regression (MLR) approach is a strategy used to develop the QSAR. MLR approach displays a direct relationship between the dependent variable Y (activity) and independent variable X (descriptors). In MLR analysis, the mean of the dependent variable Y relies on X. MLR equation below is used to incorporate more than one independent variable (descriptors) with a single response variable (activity): where Y represents the dependent variable, represent the independent variables, k's are regression coefficients for each , and C is a regression intercept [9].
. . Generation of QSAR Model and Validation. The combinations of the optimum descriptors for the training set were obtained from the descriptor pool using the Genetic Function Approximation technique. Their anti-lung cancer activities were placed as the last column in their respective spread sheets in Microsoft Excel 2010 which were later imported into the Material Studio software version 8.0 to generate the QSAR model by employing multilinear regression (MLR) approach and to evaluate the internal validation parameters [9].
. . Determination of Outlier and Influential Molecule (Applicability Domain). The applicability domain approach was employed for the determination of outlier and influential molecule. Any compound outside the applicability domain space of ±3 is said to be an outlier. To define and describe the applicability domain of the built QSAR models, the leverage ℎ approach was employed and defined as follows [10].
is training set matrix of . is the n × k descriptor matrix of the training set compound, and is the transpose of the training set ( ).
is the transpose matrix used to build the mode. The warning leverage h * is the limit values to check for influential molecule. The warning leverage h * is defined as where is the number of descriptors in the built model and is the number of compounds that made up the training set.
. . Assessment of Y-Randomization. Y-Randomization test is a confirmatory test to show that the developed QSAR model is reliable, strong, and robust and not gotten by chance. This test was performed on the training set data as described by [11]. Multilinear regression (MLR) models were generated by randomly shuffling the dependent variable (activity data) while keeping the independent variables (descriptors) unaltered. It is expected that the developed QSAR model should have significantly low 2 and Q 2 values for numbers of trials in order to ascertain that the developed QSAR model is robust. Y-randomization coefficient (c 2 ) is another important parameter which should be more than 0.5 for passing this test.
Advances in Preventive Medicine 3            Note. Superscript "a" represents the test set.

Advances in Preventive Medicine
Here c 2 is Y-randomization coefficient, R is correlation coefficient for -Randomization, and Rr is average 'R' of random models.
. . External Validation of the Model. The external validation test for the developed QSAR model was further subjected to Golbraikh and Tropsha criteria listed below: [11,12] where r 2 is the square correlation coefficients of the plot of observed activity against calculated activity values, r o 2 is the square correlation coefficients of the plot of observed activity against calculated activity values at zero intercept, r o 2 is the square correlation coefficients of the plot of calculated activity against observed activity at zero intercept, k is the slope of the plot of observed activity against calculated activity values at zero intercept, and k is the slope of the plot of calculated against observed activity at zero intercept.
. . Affirmation of the Built Model. The fitting ability, stability, reliability, predictiveness, and robustness of the developed models were evaluated by internal and external validation parameters. The validation parameters were compared with the accepted threshold value for any QSAR model [10][11][12][13] shown in Table 6.

Results and Discussion
A theoretical approach was employed to derive a QSAR model for predicting the activities of 2,4-disubstituted quinoline analogues against Mycobacterium tuberculosis. Kennard-Stone algorithm approach employed in this research was able to divide the studied compounds, which comprise 36 compounds, into a training set of 25 compounds and a test set of 11 compounds. The model generated was built on the basis of the training set while validation of the model was accessed by the test set The best descriptors that could better predict the activities of the inhibitory compounds were selected with the approach of Genetic Function Algorithm (GFA) while multilinear regression (MLR) method was used as modeling technique in generating the QSAR model. GFA-MLR led to selection of five (5) descriptors and four (4) QSAR models.
The observed activities, calculated activities of the inhibitors, the residual values, and the leverage value for each compound were reported in Table 1. The low residual values between observed activities and calculated activities indicate that the model generated has a high predictive ability. Meanwhile the calculated descriptors for training set and test set in generating model 1 were reported in Table 2 for the purpose of reproducibility.
The names and symbols of each descriptors selected by GFA approach were presented in Table 3. The combination of the selected descriptors (2D and 3D) reported in model 1 indicates that these types of descriptors are able characterize and give better information on the structure of the antitubercular molecules.
Statistics and correlation matrix of the selected descriptors that were reported in model 1 were presented in Table 4. The descriptors were subjected to Variance Inflation Factor (VIF) in order to check for orthogonality. Meanwhile, the VIF values for each descriptor shown in Table 4 were less than 4, which confirms that the descriptors were statistically significant and orthogonal.
The mean effect (ME) and standard regression coefficient ( ) values are reported in Table 4 which gives vital information on the effect of each descriptor and the degree of contribution in the developed model. The signs and the magnitude on the mean effects values indicate direction in influencing the activity of a compound and their individual strength. Table 4 represents the P-values of each of the descriptors in the model at 95% confidence level. Therefore the null hypothesis that says there is no association between the descriptors and the activities of the molecules is rejected;    thus, the alternative hypothesis that says there is a relationship between the descriptors used in generating the model and the activities of the compounds at p < 0.05 is accepted. The Person correlation coefficients calculated for the descriptors in the model were reported in Table 5. The low correlation coefficients that exist between each descriptor in the model imply that there exists no significant intercorrelation between each descriptor. External validation and internal validation parameters used to assure that the developed models are stable and robust were reported in Table 6. These parameters were in agreement with the threshold value reported in Table 6 which actually confirmed the robustness and stability of the model. Based on these validation parameters, model one was selected as the optimum model and used to predict the activities of 2,4disubstituted quinoline derivatives.
The QSAR model generated in this research was compared with the models obtained in the literature [4,5] 11) and the external validation for the test set was found to be R pred = .
[4]. From the above models the validation parameters reported in this work and those reported in the literature were all in agreement with the parameters presented in Table 6, which actually confirmed the robustness of the model generated.
Y-Randomization coefficient (c 2 ) was also conducted and has a significant value of 0.7443, greater than 0.5, which was reported in Table 7 supporting the claim that the model generated is powerful and not inferred by chance.
The graphs of calculated activities plotted against observed activities of the training and test set are presented in Figures 1 and 2. The correlation coefficient (R 2 ) value of 0.9265 for the training set and (R 2 ) value of 0.8034 for the test set recorded in this work were found to be in line with accepted QSAR threshold values reported in Table 3. This affirms the stability, reliability, and predictive power of the built model. The plot of residual activity against observed activities shown in Figure 3 indicates that there exists no computational inaccuracy in the derived QSAR model as the range of residuals values falls within an accepted limit of ±2 on residual activity axis.       The standardized residual activities plotted against the leverage value, known as the Williams plot, are shown in Figure 4. The plotted graph clearly shows that all the compounds fall within limit boundary ±3 of standardized cross-validated residuals. Hence, it can be inferred that no outlier is observed in the data set. However, compound number 30 is found to have a leverage value greater than  the calculated warning leverage (h * = 0.60). Therefore the compound is an influential molecule.
. . D-Optimal Design. D-Optimal design was carried out in order to determine optimal design location and maximize the efficiency of estimating a specified model. This was achieved using Statgraphics 18 software.
From the results presented in Table 8, the R-Squared statistic indicates that the model as fitted explains 80.9278% of the variability in observed activities. The correlation coefficient equals 0.899599, indicating a moderately strong relationship between the variables (descriptors). The standard error of the estimate shows the standard deviation of the residuals to be 0.345508. Thus, value can be used to construct prediction limits for new observations. The mean absolute error (MAE) of 0.25514 is the average value of the residuals. The Durbin-Watson (DW) statistic tests the residuals to determine if there is any significant correlation based on the  order in which they occur in the data file. Since the P-value is greater than 0.05, it implies there is no indication of serial autocorrelation in the residuals at the 95.0% confidence level.
The observed versus predicted plot presented in Figure 5 shows the observed values of Y on the vertical axis and the predicted values of X on the horizontal axis. Based on the fact that the points are randomly scattered around the diagonal line, it indicates that the model fits well. The Prediction Variance Plot presented in Figure 6 shows how the standard error of the predicted response varies across the design region. The standard error displayed is the square root of the unscaled prediction variance. A surface plot is created for the first two design factors, AATS5e and RDF110s, with all other factors held constant. In order to have an optimal design, the standard error must be at lowest near the center of the design region. It increases as the location moves away from the center in any direction. The Prediction Profile graph presented in Figure 7 displays the standard error of the predicted response as a function of each design factor as the factors are moved from a specified reference point. The location in the design region for each response was AATS5e = 3.34, RDF110s = 4.52, SpMin7 Bhe = 65.38, TDB9e = 18.89, and VR1 Dzs= 0.38, respectively. At these locations, the standard error of prediction equals 0.345508. Therefor the plot shows the location of each factor in standardized units. In standardized units, the specified low value equals -0.4, the center is 0, and the specified high value equals 0.4. The lines on the plot show how the specified standard error changes Stnd. error as the factors are moved away from the reference location. It can be clearly noticed that the standard errors remain small within the low to high range (-0.4 to 0.4) but start to increase rapidly outside that range.

Conclusion
A theoretical approach was employed in this study on selected molecular descriptors to derive a model that could be used to correlate the structure of 2,4-disubstituted quinolone derivatives as potent inhibitors against Mycobacterium tuberculosis with their respective biological activities. The model derived was subjected to internal and external validation test to confirm that the built QSAR model is significant, robust, and reliable. From the results, it is concluded that 2,4-disubstituted quinolone derivatives can be modeled using molecular descriptors, AATS5e, VR1 Dzs, SpMin7 Bhe, TDB9e, and