Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria

Efficient evaluation of biotoxicity of organics is of vital significance to resource utilization and environmental protection. In this study, toxicity of 110 alcohol compounds to tadpoles of Rana temporaria is adopted as the dependent variable and 1388 physiochemical parameters (features) calculated by PCLIENT is used for representing each compound. A feature selection pipeline with three steps is developed to refine the feature subset: 282 features that significantly correlated with biotoxicity of chemical compounds are preliminarily selected via the maximum information coefficient (MIC); 138 descriptors that have positive contribution to the model’s performance are reserved after a support vector regression (SVR) based backward elimination; 18 descriptors are finally selected via a forward selection process that integrated minimal redundancy maximal relevance (mRMR), MIC and SVR. In terms of feature subsets with different numbers of variables, quantitative structure activity relationship (QSAR) models are built using multiple linear regression (MLR), partial least square regression (PLS) and SVR, respectively. The independent prediction evaluation index, Q, increases from −74.787, 0.824 and 0.868 to 0.892, 0.878 and 0.940, for the three regression models, respectively. Results suggest that nonlinear feature selection methods involved in MIC and SVR can effectively eliminate irrelevant descriptors. SVR outperforms classical statistical models to QSAR modeling on high-dimensional data containing nonlinear relationship between features. The methods proposed in this study have a potential application in the QSAR research field such as biotoxicity compounds.


Introduction
So far, humans have discovered more than 80 million kinds of organics, many of which are entering or have entered the ecological environment via various ways.Of special note is that a majority of organics have biotoxicity.
It is an indispensable step to evaluate the intensity of biotoxicity of compounds before putting them into the environment. 1Acute toxicity to aquatic animals has been much investigated.Amphibians, such as frogs and tadpoles, are often adopted as biological materials to evaluate acute toxicity since they have highly permeable skin, which makes it easier for them to absorb surrounding materials and make them more sensitive to the polluted water. 2,3Evaluate the toxicity of organics through experimental methods is time consuming and cost ineffective, especially for evaluating thousands of organics.In addition, experimental determination of toxicity is just applicable to synthesized compounds, but loses the ability to evaluate the toxicity of compounds that have not been synthesized.
Qualitative structure-activity relationship researches the relationship between bioactivity and molecular structural parameters of compounds utilizing chemometric methodology, so quantitative structure activity relationship (QSAR) is recognized as a bridge between chemistry and biology. 4High prediction accuracy is the key problem for QSAR modeling of toxicity of organics.The model's accuracy mainly depends on the calculation of descriptors, variable selection and choice of regression models.For calculation of descriptors, it should be effective and easily acquiring, i.e., numerical descriptors can be obtained directly by quantum chemistry method, even for virtual compounds. 5Elements of descriptor subset should be statistical significant and have better interpretability.The model constructed need to guarantee its robustness and generalization ability.Support vector machine (SVM) is a strong performer in the machine learning field, which is built on the statistical learning theory and the minimum structural risk.][12][13] The maximal information coefficient (MIC) proposed by Reshef et al. 14 in 2011 is a new measure of correlation to represent the non-linear relationship between two continuous random variables.Different mutual information coefficients can be obtained by splitting two continuous variables to many intervals with unequal window size using a dynamic programming algorithm, searching for the MIC among them and to normalize it by a logarithmic operation.These operations endow MIC with advantages of both generality, which means applicable to various nonlinear function types, and equitability, similar MIC scores being obtained for different functions with equal noise.

Maximal Information Coefficient and Support Vector Regression
This study uses the molecule descriptor calculation software, PCLIENT, to calculate thousands of physiochemical parameters for every small molecule compound of alcohol. 15Optimum descriptors subset is obtained by a feature selection pipeline containing three step searching strategies: (i) select statistically significant features that imply nonlinear correlation with biotoxicity of chemical compounds using MIC based univariate filter; (ii) refine feature subset by support vector regression based backward elimination (SVR-BE); 16 (iii) obtain optimal subset via a forward selection process that integrated minimal redundancy maximal relevance, MIC and SVR.A QSAR model is finally built on the training set with the reserved descriptors, and then to predict biotoxicities of Rana temporaria in the test set.Results suggest that the model has potential prospects to QSAR research field of toxic compounds.

Data set
The data studied in this paper are extracted from literature 2 which contains 110 alcohol organic small molecule compounds, after removing 13 compounds with similar molecular structure or toxicities.The toxicity index of alcohol compounds refers to the negative logarithm of 50% growth inhibition concentration (pIGC 50 , the measurement unit is mmol L -1 ) to Rana temporaria.30 samples are randomly chosen to form the test set, of which the range of pIGC 50 values is 0.19-5.25 (see Table 1).The remaining 80 samples are regarded as the training set, and its range of pIGC 50 values is 0.24-5.30.The model built on the training set is used to predict toxicities of samples in the test set.

Acquisition of molecule descriptors
First, the molecule structural editor, JME Editor, 17 is used to draw the molecule structure, and the drawn structure is saved in the file format of the simplified molecular input line specification (SMILES).Then, SMILES is adopted as the input of PCLIENT, 18 by which the descriptors for every molecule structure can be calculated. 15

Screening of descriptors MIC based univariate filter
By involving a dynamic splitting on the scatter diagram for two continuous variables, we can obtain a MIC corresponding to the optimal splitting pattern.The MIC is defined as equation 1: MIC(x,y) = max{I(x,y))/log 2 min{n x , n y }} (1)   where I(x,y) represents the mutual information between x and y; n x and n y denote the number of bins into which x and y are partitioned, where n x × n y < B(n) {B(n) = n 0.6 }; n is the number of samples and 0.6 is empirical value suggested by Reshef et al. 14 The MICs are calculated sequentially on the toxicity experimental value and each of the physiochemical properties.Since the nonlinearity between every descriptor and dependent variable does not belong to specific distribution, it is hard to say how significant a MIC is by using classical hypothesis testing.However, the uncorrected p-value of a given MIC score under a null hypothesis of statistical independence depends only on the score and on the sample size.So it can be computed by selecting a probability α of false rejection, creating a set of 1/α − 1 surrogate bivariate by choosing a random permutation of Y with respect to the independent variable.And then we compare the MIC of the real bivariate with the MIC scores of the surrogate bivariate for a given sample size. 19We can filter the descriptors whose p-value of Bonferroni correction is larger than 0.05.The R package Minerva 20 is employed to work out the MIC.

SVR based backward elimination 16
Suppose the variable subset being a data matrix, (y i , x ij ), i = 1, 2,…, n, j = 1, 2,…, m, where n is the sample size and m is the number of reserved variables.Firstly, an initial mean square error MSE 0 can be obtained by an SVR based cross validation on the training set.Then the MSE vector (MSE 1 ,…, MSE j ,…, MSE m ) can be obtained after the j th descriptor is removed one by one.The descriptor that corresponds to the minimum of MSE vector is deleted.If the MSE vector are greater than the MSE 0 , the backward elimination will be stopped.Otherwise, further backward elimination continues in the same way repeatedly.

Forward selection
Minimal redundancy maximal relevance (mRMR) 21 is a well-known feature selection algorithm in pattern classification studies referring to high-dimensional data.Its several disadvantages have been summarized in previous study, 22 such as the limitation on solving regression problems, incomparable correlation measures between X i versus Y and X i versus X j (where X i is one of independent variables and Y means dependent variable), sensitivity to non-normally distributed data, and cannot reflect nonlinear redundancy among variables.Deng et al. 22 employed a nonlinear correlation measure named distance correlation (dCor) 23 to improve the mRMR algorithm and obtained outstanding prediction accuracies on QSAR modeling of several datasets.Although the method they proposed (mRMR-dCor) overcame parts of the disadvantages in initial mRMR, dCor is not equitable even in the basic case of functional relationships. 14In addition, mRMR-dCor is very time-consuming on high-dimensional data since it uses forward selection strategy from the original feature set.Considering the advantages of MIC and time-efficiency for feature selection, we embed MIC rather than dCor into the mRMR and set the reserved features obtained from previously steps as input variables to promote the efficiency of feature selection.The equation of mRMR-MIC is defined as follows (equation 2): ( where S represents the feature subset that has been introduced, Ωs represents the feature subset that has not been introduced, X j ∈ S. A significant order of features can be obtained by utilizing equation 2. And then the forward selection process should be conducted to remove redundant features by using SVR-based 5-fold cross-validation during each iteration. 24gression models

Multiple linear regression (MLR)
Multiple linear regression (MLR) is the most classical and commonly-used regression model in statistics.Its working principle is simple and the model built based on the principle is easy to comprehend.Thus, it has found wide applications in QSAR research.The MLR equation can be written as equation 3: where, ŷ is the dependent variable; x is the independent variable; b 0 is the constant term; b 1 to b m represent the partial regression coefficients.The MLR model in this research is conducted through the "regress.m"program in the MATLAB statistical toolkit. 25

Partial least square (PLS) regression
The partial least squares (PLS) regression is a multivariate statistical method for model prediction based on correlation between latent variables.Combining major advantages of the principal component analysis (PCA), correlation analysis and MLR, it can capture the intrinsic structural information of the data set and can also depict the correlation between independent variable and dependent variables more effectively, which results in an improved performance when modeling.
In this paper, the PLS regression model is performed through the "plsregress.m"program in the MATLAB statistical toolkit.The optimal number of latent variables in a PLS model corresponds to the minimum MSE obtained by 100 replicates on a 10-fold cross validation.

Support vector regression (SVR)
Support vector machine is a new method proposed on the basis of the statistical learning theory, which is a popular method in model recognition and machine learning. 26VR is efficient in resolving nonlinearity, overfitting and locally optimal solution, especially for a data set with small sample size and high dimensionality.The core of SVR is the equation for the building of a hyperplane (equation 4): The kernel function can be used to map the variable to a high-dimensionality space so that the two types of samples are made dividable through the hyperplane.Meanwhile, the interval between every variable and the hyperplane is made the maximal.At the moment, the vector closest to the hyperplane is addressed as the support vector.As mentioned above, SVM is made up of SVC and SVR.The former is for classification, while the latter is for regression.In this research, SVR is adopted.
In this paper, the SVR model is performed through the LIBSVM programmed by Chang et al. 5 The radial basis function is set as the kernel when modeling.Parameters in the software package to be optimized include the penalty parameter, c; the parameter of the radial kernel function, g; and the parameter of the loss function, p.The parameter optimization is performed through a grid searching.

Model evaluation indexes
The independent prediction precision of the model adopts the root mean square error of prediction (RMSEP) and the Q 2 proposed by Tropsha et al. 27 as the evaluation indexes (equations 5 and 6): (5) Vol. 30, No. 2, 2019 (6)   where y i is the observed value of the independent variable in the test set; ŷ i is the predicted value of the independent variable in the test set; n te is the number of samples in the test set; is the mean value of the independent variable in the training set.

Calculation, preprocessing and screening of descriptors
The structural formulas of all chemical compounds are entered in the online service software, PCLIENT.After calculation, 1917 descriptors are obtained.Any descriptors containing 999 or −999 are eliminated, and variables whose variance is 0 are eliminated.1388 effective descriptors are remained and will be used for the following QSAR modeling.Later, MIC based univariate filtering are conducted on the toxicity experimental value and each of the physiochemical properties.It is found out that there are 282 descriptors with a corrected p-value smaller than 0.05.SVR-RE is further adopted for refining the variable subset and 138 descriptors are reserved, and mRMR-MIC is employed to screening out the final subset, 18 descriptors are reserved in the end.

Model comparison
First, 1388, 282 138 and 18 descriptors are adopted as independent variables, respectively, to evaluate the effectiveness of our descriptor screening methods.Then, the MLR, PLS regression and SVR are adopted for modeling on each of the descriptor set.The results are shown in Table 2.As one notices in Table 2, the number of descriptors are screened to 18 from 1388, the independent prediction precisions (RMSEP) of the MLR model, the PLS model and the SVR model improved to 0.372, 0.429, 0.277 from 9.858, 0.475, 0.411, respectively, suggesting that the variable screening method is valid to the three models.However, the modeling effect of MLR gradually decreases while the number of variables were reduced to 138 from 1388.Since the PLS regression model adopts latent variables to build the model, the high-dimensionality space is further compressed.The nonlinear information of variables is partially utilized.Horizontal comparison of different models suggests that the fitting accuracy of MLR on the training set is great, but with a poor performance on the test set.This indicates that there is extreme overfitting in the MLR model.The fitting precision of the PLS regression on training set also is very good.In particular, its prediction precision on the independent test set is obviously better than the MLR model, but lower than the SVR.Although the fitting precision of SVR on training set in different feature sets is lower than the MLR model, the prediction precision of SVR model outperforms the other two models while on independent test sets.It indicates that the SVR model can overcome overfitting problems effectively.

Descriptor significance and model robustness
There are more than 3000 descriptors that fall into 24 groups in the PCLIENT web database, different descriptors in the same grouping have similar physicochemical properties.The 18 descriptors screened out are mapped to the molecular physiochemical property groups provided by PCLIENT shown in Table 3.These descriptors are distributed in 11 molecular physiochemical property groups.The model built on the 18 descriptors has a high precision either in terms of fitting prediction or in terms of independent prediction.Figure 1   Xternal Validation Plus 42,43 is a tool which computes all the required external validation parameters, while further it also judges the performance of prediction quality of a QSAR model based on the MAE-based criteria.We take the 30 independent test samples as input to validate.The MAE-based metrics estimated that the model shows 'GOOD' predictions (after removing 5% test set objects with high residual values).

Conclusions
PCLIENT is used to represent the alcohol organic small molecule compounds.Every compound obtains 1,388 physiochemical parameter descriptors, covering multiple properties, such as hydrophobicity, topography, electrophilicity and three-dimensional property.The properties of the alcohol organic small molecule compounds are comprehensively and systematically represented.However, in terms of the QSAR model, the irrelevant, redundant descriptors can influence the prediction precision.Thus, this paper first calculates the MIC of all descriptors.Then, based on the p-value corrected by Bonferroni, 282 descriptors whose significance value is below 0.05 are screened out.After multiple round of backward elimination, 138 critical physiochemical descriptors are retained, mRMR-MIC method is further employed to filtering the variable set, 18 descriptors are screened out in the end.Based on the 18 molecule descriptors, the SVR algorithm is employed to build the QSAR model.The newly-built QSAR model is then applied to predict the biotoxicity of test samples, and the prediction results are favorable.The prediction indexes, Q 2 and RMSEP, reach 0.940 and 0.277, respectively.Compared with the classical statistical models, MLR and PLS regression, the model proposed in this paper is significantly superior, and has promising prospects to be further applied to QSAR research of toxicity of alcohol organic small molecule compounds.Figure 1.Experimental value and predicted value of the independent test set based on SVR (support vector regression) model.Take the 18 descriptors as input features and take the 30 samples (see Table 1) as independent test set, then calculating the pIGC 50 values (predicted values) by SVR.The closer the circle mark is to the diagonal line, the better is the predict performance.
Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria Lifeng Wang, a,b Pengwei Xing, c Cong Wang, a,b Xiaomao Zhou, a,b Zhijun Dai* ,c and Lianyang Bai* ,a,b compares the independent test set predicted value of the SVR model and the experimental value.It can be seen that most sample

Table 1 .
Toxicity experimental values (pIGC 50 ) of the test set (based on 18 reserved descriptors) and predicted value by different models Exp: experimental; MLR: multiple linear regression; PLS: partial least square regression; SVR support vector regression.

Table 2 .
Com40rison of the training set fitting precision and independent prediction precision of different models fall nearby the line of 45°.The error between the experimental value and the predicted value is small, meaning that the 18 descriptors screened out have a high significance.The SVR model has a high robustness.Enalos Nodes are cheminformatics tools developed by NovaMechanics Ltd.39,40It is a package of KNIME software, so we can use Enalos Nodes after the package are installed on KNIME.41EnalosNodes contains four nodes, we use the Applicability Domain (APD) based on the Euclidean distances to verify the domain of applicability.Take the data of training set (Y are excluded) and test set (Y are excluded) as input, the results showed that only one sample are unreliable in 30 samples. points

Table 3 .
Physicochemical property groups of the 18 reserved descriptors