Application of Multivariate Adaptive Regression Splines (MARSplines) for Predicting Hansen Solubility Parameters Based on 1D and 2D Molecular Descriptors Computed from SMILES String

A new method of Hansen solubility parameters (HSPs) prediction was developed by combining the multivariate adaptive regression splines (MARSplines) methodology with a simple multivariable regression involving 1D and 2D PaDEL molecular descriptors. In order to adopt the MARSplines approach to QSPR/QSAR problems, several optimization procedures were proposed and tested. The effectiveness of the obtained models was checked via standard QSPR/QSAR internal validation procedures provided by the QSARINS software and by predicting the solubility classification of polymers and drug-like solid solutes in collections of solvents. By utilizing information derived only from SMILES strings, the obtained models allow for computing all of the three Hansen solubility parameters including dispersion, polarization, and hydrogen bonding. Although several descriptors are required for proper parameters estimation, the proposed procedure is simple and straightforward and does not require a molecular geometry optimization. The obtained HSP values are highly correlated with experimental data, and their application for solving solubility problems leads to essentially the same quality as for the original parameters. Based on provided models, it is possible to characterize any solvent and liquid solute for which HSP data are unavailable.

Due to the high usability of HSP, many experimental and theoretical methods of determining these parameters were proposed. For example, HSP can be calculated utilizing the equation of state [38] derived from statistical thermodynamics. Alternatively, models taking advantage of the additivity concept, such as the group contribution method (GC) [25,[39][40][41] is probably the most popular one. Despite the simplicity and success of these approaches, there are some important limitations. First of all, the definition of groups is ambiguous which leads to different parameterization provided by different authors [39]. Besides, the same formal group type can have varying properties, depending on the neighborhood and intramolecular context. As an alternative, molecular dynamics simulations were used for HSP values determination [16,[42][43][44] even in such complex systems as polymers. Interestingly, quantum-chemical computations were rarely used for predicting HSP parameters. However, the method combining COSMO-RS sigma moments and artificial neural networks (ANN) methodology [45] deserves special attention. Noteworthy, much better results were obtained using ANN than using the linear combination of sigma moments [45]. e application of nonlinear models is a promising way of HSP modeling. In recent times, there has been a significant growth of interest in developing QSPR/QSAR models utilizing nonlinear methodologies, like support vector machine [46][47][48][49][50] and ANN [51][52][53][54][55] algorithms. e attractiveness of these methods lies in their universality and accuracy. However, many are characterized by complex architectures and nonanalytical solutions. An interesting exception is the multivariate adaptive regression splines (MARSplines) [56]. is method has been applied for solving several QSPR and QSAR problems including crystallinity [57], inhibitory activity [58,59], antitumor activity [60], antiplasmodial activity [61], retention indices [62], bioconcentration factors [63], or blood-brain barrier passage [64]. Interestingly, some studies suggested a higher accuracy of MARSplines when compared to ANN [57,58,65]. An interesting approach is the combination of MARSplines with other regression methods. As shown in the research on blood-brain barrier passage modeling, the combination of MARSplines and stepwise partial least squares (PLS) or multiple linear regression (MLR) gave better results than pure models [64]. e MARSplines model for a dependent (outcome) variable y and M + 1 terms (including intercept) can be summarized by the following equation: where summation is over M terms in the model, while F 0 and F m are the model parameters. e input variables of the model are the predictors x ](k,m) (the kth predictor of the mth product). e function H is defined as a product of basis functions (h): where x represents two-sided truncated functions of the predictors at point termed knots. is point splits distinct regions for which one of the formula is taken, (t − x) or (x − t); otherwise, the respective function is set to zero. e values of knots are determined from the modeled data.
Since nonparametric models are usually adaptive and with a high degree of flexibility, they can very often result in overfitting of the problem.
is can lead to poor performance of new observations, even in the case of excellent predictions of the training data. Such inherent lack of generalizations is also characteristic for the MARSplines approach. Hence, additionally to the pruning technique used for limiting the complexity of the obtained model by reducing the number of basis functions, it is also necessary to augment the analysis with the physical meaning of obtained solutions. e purpose of this study is to test the applicability of the MARSplines approach for determining Hansen solubility parameters and to verify the usefulness of the obtained models by solubility predictions. Hence, an in-depth exploration was performed, including resizing of the models combined with a normalization and orthogonalization of both factors and descriptors. Also, a comparison with the traditional multivariable regression QSPR approach was undertaken. Finally, the obtained models were used for solving typical tasks for which Hansen solubility parameters can be applied, in order to document their reliability and applicability.
Using information encoded in canonical SMILES, PaDEL software [69] offers 1444 descriptors of both 1D and 2D types. Not all of them can be used in modeling, and those descriptors which are not computable for all compounds or with zero variance were rejected from further analysis. e remaining 886 parameters were used for models definition.

Computational Protocol.
Model building was conducted using absolute values of descriptors or orthogonalized data. Since there are different criteria for selecting independent variables from the pool of mutually related ones, two specific criteria were applied. e first one relied on the direct correlation with modeled HSP data if R 2 > 0.01. e second one used ranking offered by Statistica [70], tailored for regression analysis. ese parameters were considered as nonorthogonal ones for which the Spearman correlation coefficient was higher than 0.7 (R 2 > 0.49). ese different methods of orthogonalization led to different sets of descriptors used during application of QSPR or MARSplines approaches. Types of performed computations are summarized on Scheme 1.

QSPR Approach.
e development of QSPR models and internal validation of the multiple linear regression (MLR) approach was conducted using QSARINS software 2.2.2 [71,72]. e genetic algorithm (GA) for variable selection was applied during the generation of the models, which were defined with no more than 20 variables. e following fitting quality parameters were used for the model evaluation: determination coefficient (R 2 ), adjusted determination coefficient (R adj ) 2 , Friedman's "lack of fit" (LOF) measure, global correlation among descriptors (K xx ) [73,74], rootmean-square error, and mean absolute error (RMSE tr and MAE tr ) calculated for the training set and F (Fisher ratio). Also, the following internal validation parameters were used: leave-one-out validation measure (Q loo ) 2 , cross-validation root-mean-square error, and mean absolute error (RMSE cv and MAE cv ).

Results and Discussion
Since the aim of this paper is the verification of the efficiency of predicting Hansen solubility parameters based on models derived using the MARSplines approach, two alternative procedures were adopted. e first one relies directly on the solution coming from application of the MARSplines procedure. e resulting factors were then used for assessment of p, d, and HB parameters. Alternatively, in the second step, the obtained factors were used as new types of descriptors and applied in the standard QSPR modeling along with the ones obtained from PaDEL. e premise of such attempt relied on the assumption that new factors, accounting for nonlinear contributions, combined with descriptors raise the accuracy of the model. e consistency of the models was checked using an internal validation procedure and additionally by applying them for solving some typical tasks that utilize Hansen solubility parameters. Particularly, the classification of polymers as soluble and nonsoluble ones in a set of solvents was compared with the original values of Hansen parameters. Similarly, the prediction of preferential solubility of some drugs was tested.
3.1. MARSplines Models. Several models were computed using the whole set of 886 available descriptors (run1 and run2). Typically, the size of the problem was restricted to 25 or 30 basis functions with the number of interactions increasing from 2 up to 10. For example, the simplest model restricted to 25 basis functions with no more than double interactions is denoted as (25,2). For each model, the regressions were analyzed in two manners. Firstly, the direct application of the set of factors obtained from MARSplines was performed for solving regression equations. Since some of the generated factors have shown an apparent linear correlation, the orthogonalization of the factors was undertaken according to the two mentioned approaches. is resulted in two alternative models, usually of lower complexity.

MARSplines Modeling of Parameter d. Hansen solubility
parameter d is the measure of interaction energy via dispersion forces. As other contributions to Hansen solubility space, it is expressed as the density of cohesive energy. Among all three descriptors, this one seems to be the most difficult to predict. Fortunately, the MARSplines procedure performed quite well even in this case. e details of all developed models are provided in Figure 1, which offers several interesting conclusions. First of all, the models with satisfactory descriptive potential are quite complex, requiring several factors. Fortunately, the actual number of descriptors is usually much lower since many factors utilize the same molecular descriptors. Besides, models relying on the absolute values of descriptors outperform models constructed using normalized descriptors.
is seems to be surprising since normalization should not lead to any change in the model quality; however, in the case of MARSplines, there is a significant gain in using absolute values.
is can be attributed to the very nature of MARSplines, which is strictly a data-driven nonparametric procedure. Another interesting conclusion comes from inspection of trends indicated by the solid black lines. e rise of the number of interactions does not seriously improve the quality of predictions. Although the d(30, 10) model is slightly better than d (25,2), it comes at a cost of additional three factors. is is a fortunate circumstance, suggesting that developing simpler models can be quite sufficient. In the case of the d(25, 2) model, the value of the adjusted correlation coefficients (R adj ) 2 is as high as 0.94. e formal mathematical formula of the MARSplines-derived model is analogical to a typical QSPR equation, although instead of descriptors, the MARSplines factors are present. In the case of the d(25, 2) model, Eq. (4) defines the mathematical formula for computation of the d parameter.

Journal of Chemistry
Factors definitions, along with their contributions, were summarized in Table 1: e values of coefficients come from the internal validation procedure performed using the QSARINS default algorithm. It is a typical many-leave-out procedure rejecting 30% of the data. e correlation between experimental and computed values of the d solubility parameter is plotted in Figure 2. Both data for d (25,2) and d(30, 1) models were provided. It is quite visible that the gain of the extended model is not very impressive, and for further applications, the d parameter will be computed according to model defined by Eq. (4). Although formally there are nineteen factors in this equation, some can actually be consolidated as one. For example, F1 appears in definitions of F3, F4, F17, and F18. It seems to be rational to consolidate them into one by extraction of F1 and redefining the factors by multiplication of the sum of the remaining parts by F1. is in fact does not change the size of the problem, which should be attributed to the number of descriptors used in definition of MARSplines factors rather than factors. In the case of Eq. (4), twelve PaDEL descriptors are used. e majority of them (ATSC1i, AATS2e, AATS2p, ATSC3p, AATSC6v, ATSC1v, ATS4m, and GATS6c) belongs to 2D autocorrelation descriptors [75]. One descriptor VE3_Dzi is of the Barysz matrix type [75]. Besides, atom-type electrotopological state 2D descriptors (SsOH and minHCsats) were also included in the model [76][77][78]. Finally, the values of the nHBDon_Lipinski descriptor are also used in the model, and this parameter represents simply the number of hydrogen bond donors.
As it was mentioned beforehand, the construction of the models using MARSplines factors can in some cases lead to apparent mutual linear correlation between these factors. In all observed cases, these dependencies were really superficial and resulted from the fact that the basis functions used knots for splitting values below and above the given threshold. In such situation, the correlation, even if mathematically detectable, has no significant meaning and is artificial. From the formal point of view, it is possible to rearrange such factors in the regression function, consolidating them into one and removing these apparent correlations. However, it was interesting to observe if it is possible to reduce the number of factors in the model by eliminating these apparently nonorthogonal ones. For this purpose, two types of orthogonalization were performed, and the results are presented in Figure 2. First of all, the models were significantly worse compared to the original ones.
is is not surprising, since after orthogonalization, fewer factors were used in the final regression function, which resulted not only from elimination of apparently related ones but also from the fact that correlation coefficients in new regressions were not statistically significant. Indeed, the reduction of the d (25,2)   Without normalization but with orthogonalization * * separately for each parameter run4 With both normalization and orthogonalization * * separately for each parameter * In this modeling, the whole set of available parameters was used (886 descriptors) for each parameter. * * Two rankings of descriptors were used. First one (A) was done according to direct correlation with modeled data that provided R 2 > 0.01. Application of first type of orthogonalization and exclusion of theparameters with R 2 < 0.01 reduced the number of descriptors down to127 in the case of the d parameter, 134 for p parameter and 128 the most appropriate for the HB parameter. e second one (B) used ranking offered by Statistica, selecting the most suitable parameters for regression analysis. Application of the first type of orthogonalization and excluding parameters with R 2 < 0.01 reduced the number of descriptors down to 118 in the case of d and HB parameters, and down to 124 for the p parameter.

MARSplines Modeling of Parameter p.
Series of models for computing the polarity descriptor was also developed, and their predictive powers are summarized in Figure 3. e quality of the correlation between experimental values and the ones predicted using the best models is illustrated in Figure 4.
As one can infer from Figure 3, the best model with orthogonal factors is p (30,10). However, it is characterized by a high degree of descriptors interaction. erefore, the most optimal one seems to be p (25,3). is model is expressed by Eq. (5), and the factors descriptions along with their contributions are summarized in Table 2. is model utilizes descriptors belonging to several classes, namely, information content (IC0 and ZMIC2) [75], autocorrelation (AATS2m, GATS1e, GATS2e, GATS5m, AATSC5i, ATSC5e, and MATS1v) [75], molecular linear-free energy relation (MLFER_S) [79], mindssC [76][77][78], and Petitjean topological and shape indices (PetitjeanNumber) [80]. e reduction of variables achieved using the genetic algorithm does not always guarantee that descriptors with clear meaning will be selected. Nevertheless, among descriptors    which appeared in the p(23, 3) model, IC0 and MLFER_S are quite simple to interpret in the context of polarity HSP since IC0 index expresses the diversity (heterogeneity) of atomic types [81], while MLFER_S is associated with the dipolarity/ polarizability features of molecules [57,82,83]. Also autocorrelation descriptors GATS1e, GATS2e, and MATS1v deserve for special attention. In general, autocorrelation indices do not have a clear interpretation. Nevertheless, their   appearance seems to be understandable since these descriptors were applied in different solubility prediction models reported previously [84][85][86]:

MARSplines Modeling of Parameter HB.
Analogously to the previously discussed parameters, the model corresponding to the hydrogen bonds interactions was developed and optimized. e results are summarized in Figures 5  and 6.
e SHBd descriptor is simply the sum of all E-States corresponding to hydrogen bonds donors [76][77][78]. ETA_dEpsilon_D parameter is also associated with hydrogen bonds donating abilities. us, both SHBd and ETA_dEpsilon_D descriptors have been used for QSAR protein binding/inhibition problems solving [92][93][94][95]. e appearance of Crippen-LogP, being a part of the F3 factor, is understandable since more polar molecules are usually more likely to form strong hydrogen bonds. Noteworthy, LogP, which is probably one of the most popular polarity parameters, was used for the Yalkowsky model [96,97], which confirms its usability in the HSP approach. Based on the F3 definition (Table 3), an interesting observation can be made; when CrippenLogP values are lower than about −2.34, the polarity is extremely high and so it does not affect the ability to form hydrogen bonds. is treatment of variables, associated with the determination of their scope of application, is characteristic for the MARSplines methodology. Similarly, as in case of other HSP models, autocorrelation descriptors play an important role.
ese molecular measures are related to the basic atomic properties such as Sanderson electronegativities (GATS2e), ionization potential (AATSC1i and AATSC2i), and van der Waals volume (ATSC1v). [71,72] offers a straightforward method for regression analysis, especially efficient in the case of large QSPR problems. In such cases, the complete exploration of all possible combinations of descriptors is prohibited by too large numbers of potential arrangements of the variables. In such situation, the genetic algorithm [98] offers a rational way of exploration of the most promising regions of QSPR solution space. Here, all QSPR models were built based on orthogonal sets of descriptors, that is denoted as run3 and run4, according to two different ways of orthogonalization (Scheme 1). Besides,  Figure 7. e developed models are of varying size, starting from 2 up to 20 parameters. However, QSPR models are fairly saturated starting from nine parameters. e most important message coming from Figure 7 is that the classical QSPR formalism leads to modes which are signi cantly less accurate compared to MARSplines. Even models with several parameters do not reach the quality of description o ered by the model de ned by Eq. (4). Inclusion of all MARSplines factors into the pool of descriptors leads to a serious improvement of linear regression approach but is still far from the best solution. It seems that, in the case of the d parameter, there is no gain in combination of MARSplines factors with PaDEL descriptors and searching for the solution via the QSPR approach. Similar conclusion can be drawn based on plots provided in Figure 8, documenting the accuracy of the models developed for computing the p parameter. However, since in this case, there is a serious discrepancy between the original MARSplines model and the reduced one, and some QSPR models exceed the accuracy of the latter. Only 20-parameter regression functions reach similar accuracy as the MARSplines model de ned by Eq. (5). Finally, similar analysis was performed for modeling of the HB parameter. is time a quite di erent set of data was obtained, as documented in Figure 9. Quite satisfying accuracy can be achieved even when 4 factors are  3.6. Applications of MARSplines Models. One of the most often used and direct applications of Hansen solubility parameters is the selection of appropriate solvent for solubilization or dispergation of di erent solids and materials including drugs [22][23][24][25][26], polymers [5][6][7][8][9], herbicides [7], pigments and dyes [3,18], and biomaterials [99]. It is typically done by computing HSP parameters based on a series of solubility measurements. Typically, 20-30 solvents are used for covering a broad range of Hansen parameters space [20,39,100,101]. Alternatively, mixtures of two solvents are prepared in such a way that the broad range of HSP is covered by solutions [102][103][104][105][106]. e formal procedure of solvents classi cation utilizes some threshold of solubility for distinguishing soluble cases from nonsoluble ones. Di erent criteria may be applied, but very often, the dissolution of the solid solute below 1 mg per 100 ml is considered as insoluble [107][108][109][110]. Hence, the solubility measurements can be reduced to the list of good and bad solvents, which resembles strong or weak interactions of the tested media with considered substance or material. e collection of three HSP parameters for all the solvents is plotted in a 3-dimensional space providing the location of solubility spheres. Additionally, empirical parameter dening the size of the sphere is computed for maximizing the classi cation for highest prediction rate of experimentally derived binary solubility data. is minimization protocol can be done using dedicated software, as, for example, HSPiP (Hansen solubility parameters in practice) [66]. However, it is also possible to take advantage of the denition of the contingency table or confusion matrix often used to describe the performance of a classi cation model. Here, this strategy was adopted for the solubility classication by using the straightforward procedure of maximizing the values of balanced accuracy (BACC (TP/ P + TN/N)/2), where TP and TN denote true positives and negatives, while P and N represent all positive and negative cases, respectively. is measure is one of the most commonly used ways of quanti cation of binary classi ers. It seems to be a natural adaptation of this terminology for rating the solubility as a mathematically coherent approach. Besides, no dedicated software is necessary, and any solverlike algorithms can be applied. e results provided below were computed using the evolutionary algorithm implemented in Excel.

Application of HSP Models to Polymers Dissolution.
e collection of the polymer solubility data was taken from the literature [39]. e experimentally measured data were originally classi ed on a scale described by the following quali ers: (1) soluble, (2) almost soluble, (3) strongly swollen and slight solubility, (4) swollen, (5) little swelling, and (6) no visible e ect. is list was converted into binary data by assuming polymer solubility only in the rst case and treating other situations as nonsoluble polymers. For the whole set of 33 polymers for which solubility was determined in 85 solvents, the classi cation was done by optimization of all three HSP, as well as R o for each polymer.
e solubility was predicted based on the classical formula of the distance in HSP space as follows: where the subscript P denotes the polymer and S the solvent. Four sets of solvent parameters were tested. ey corresponded to (a) our model provided this paper in Eqs. (4)-(6), (b) original set of parameters collected in Table A1 of "Hansen solubility parameters: a user's handbook. Appendix A" [39], (c) collection provided by Járvás et. al [45], and (d) HSP parameters from the green solvent set [111]. Following the Hansen concept, the relative energy di erence (RED) is de ned by the following ratio: where R 0 denotes the tolerance radius of a given polymer. In this approach, the material characterized by the model as RED > 1 is considered to be resistant to a solvent, whereas cases for which RED < 1 are regarded as soluble. During the procedure of solubility classi cation, the HSP values characterizing the solvent were kept intact and only the parameters for the polymer were adjusted for maximizing BAC for the whole set. e results of these computations are summarized in Table 4.
In all cases, the identi cation of true positive and true negative cases was higher than 90%. e misclassi cation of soluble pairs as insoluble ones and vice versa was always lower than 10%. Although the results of classification using our models are somewhat worse, the difference is not statistically significant, and all approaches lead to the same quality of polymers solubility classification.

Application of HSP Models to Drug-Like Solids
Dissolution. As the second type of external validation of the proposed model via application of the HSP procedure, the classification of solubility of drug-like solid substances was undertaken. Solubilities of benzoic acid, salicylic acid, paracetamol, and aspirin were taken from Stefanis and Panayiotou paper [25]. Again, maximizing of BACC was done by adopting HSP parameters. e results of the performed classification are collected in Table 5. In the third column of Table 5, there is provided the success rate obtained based on HSP values computed using the proposed model (Eqs. (4)-(6)), confronted with the success rate of the HSP approach adopted by Stefanis and Panayiotou [25] in the second column. It is worth mentioning that these authors used four parameters by splitting the hydrogen bonding part into donor and acceptor contributions. As it is documented in Table 5, the solubility predictions are almost of the same quality. In the case of benzoic acid and salicylic acid, a slightly lower quality of prediction was achieved. On the contrary, in the case of paracetamol, the success rate of the MARSplines model is higher. e predictions based on the HSP, presented in the Tables 4 and 5, are characterized by quite good accuracy. However, it should be taken into account that, there are also other approaches which were successfully used for solubility prediction, classification, and ranking such as linear solvation energy relationship (LSER) models including the Abraham equation [112,113] and the partial solvation parameters (PSPs) approach [114,115], conductor-like screening model for real solvents (COSMO-RS) [116][117][118], UNIFAC [119][120][121], and finally (modified separation of cohesive energy density) MOSCED methodology [122,123] which is an interesting extension of the HSP method. Nevertheless, HSP are, due to their universality, still very popular in solving many solubility and miscibility problems. In addition, it is also worth noting that, the proposed MARSplines model is characterized by a relatively high accuracy, although it was based only on the simplest 1D and 2D structural information retrieved from the SMILES code. erefore, the model can be extended with more complex molecular descriptors, such as quantum-chemical indices.

Conclusions
MARSplines has been found to be a very effective way of generating factors suitable for prediction of three Hansen solubility parameters. e most important factor is preserving the formal linear relationship typical for QSPR studies and extending the model with nonlinear contributions. ese come from the basis function definition and splitting the variable range into subdomains separated by knots values. Besides, factors used in the definition of the regression equations are constructed by multiplication of some number of basis functions that is referred to as the level of interactions. It is possible to formulate models with acceptable accuracy and user-defined complexity in terms of the number of basis functions and the level of interactions. It has been found that, for all three HSP parameters studied here (p, d, and HB), a promising precision was provided by quite simple models. e initial number of basis functions limited to 25 was found to be sufficient along with at most binary or ternary interaction levels. e internal validation of these models proved their applicability. e combination of descriptors with factors was also tested, but the obtained solutions were discouraging. Typical QSPR procedure relying on genetic algorithms for selecting the most adequate descriptors failed in finding models of the quality comparable with MARSplines. Only in the case of HB parameters, the result of the best QSPR models reached accuracy close to the MARSplines approach. Hence, it is not advised to combine traditional QSPR approaches by augmenting the pool of descriptors with factors derived in MARSplines. e observed supremacy of the latter in the case of HSP prediction suggests using it as a standalone procedure, especially since it offers a similar formal equation as traditional QSPR.
e application of the HSP models derived using MARSplines for typical solubility classification problems leads to essentially the same predictions as for the experimental sets of HSP.
is conclusion is a promising circumstance for further development of multiple linear regression models augmented with nonlinear contributions.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.