Machine learning in prediction of intrinsic aqueous solubility of drug‐like compounds: Generalization, complexity, or predictive ability?

We present a collection of publicly available intrinsic aqueous solubility data of 829 drug‐like compounds. Four different machine learning algorithms (random forests [RF], LightGBM, partial least squares, and least absolute shrinkage and selection operator [LASSO]) coupled with multistage permutation importance for feature selection and Bayesian hyperparameter optimization were used for the prediction of solubility based on chemical structural information. Our results show that LASSO yielded the best predictive ability on an external test set with a root mean square error (RMSE) (test) of 0.70 log points, an R2(test) of 0.80, and 105 features. Taking into account the number of descriptors as well, an RF model achieves the best balance between complexity and predictive ability with an RMSE(test) of 0.72 log points, an R2(test) of 0.78, and with only 17 features. On a more aggressive test set (principal component analysis [PCA]‐based split), better generalization was observed for the RF model. We propose a ranking score for choosing the best model, as test set performance is only one of the factors in creating an applicable model. The ranking score is a weighted combination of generalization, number of features, and test performance. Out of the two best learners, a consensus model was built exhibiting the best predictive ability and generalization with RMSE(test) of 0.67 log points and a R2(test) of 0.81.


| INTRODUCTION
Solubility is a critical topic in pharmaceutical development as it can be a limiting factor to drug absorption. 1 High attrition rate in drug development has been attributed to poor water solubility. 2 Predictive models such as quantitative structure-property relationships (QSPRs) can be useful tools to determine the solubility of a bioactive compound starting already in early development stages. Llinas and Avdeef 3 initiated the second solubility challenge in 2019 in order to engage the scientific community to address this challenging problem.
The first solubility challenge published by the same authors 4 demonstrated clear room for improvement in predicting solubility from (molecular) structural information. Palmer and Mitchell 5 concluded that there is still room for improvement with respect to predictive capabilities of QSPR rather than the lacking quality of data. Nevertheless, there is still a lack of public data available to develop quality models or at least cover a larger chemical space. In fact, it is the aforementioned solubility challenges that made quality data available. At the same time, pharmaceutical companies still own a large amount of unpublished data. Using such an unpublished dataset with experimental values of 38,841 compounds, Montanari et al. 6 tested multitask neural networks for solubility prediction. The authors built a model that yielded a cross-validated R 2 value of 0.59 (root mean square error [RMSE] not published). Such a data size for solubility is rare among publicly available datasets. Even though one cannot be sure about the quality of proprietary data, it might confirm Palmer's conclusion about limitations in modeling capabilities.
Many other research groups also dealt with the solubility prediction challenge,  attempting to predict both logS w (aqueous solubility; measured at a certain pH) and logS 0 (intrinsic solubility; solubility of a compound in its free acid or base form). 1 Key studies were summarized in Table S1. A comparison with previous studies is difficult because the authors often analyze the model quality in different manners (train, test, cross-validation, out-of-fold) and involved a multitude of model metrics. 31 Specifically, for the intrinsic solubility, literature values of the predictive performance of models on external test sets expressed by RMSE appear to vary between 0.7 and 1.05 log points 13,15,17,18,26,28,32 using a plethora of machine learning algorithms and datasets.
The most recent study from Avdeef 17 with the largest curated database known (6355 logS 0 entries) applied the random forests (RF) algorithm yielded RMSE(test) in a range of 0.75-1.05 and with an R 2 values between 0.66 and 0.83 across several models. These results outperform studies with the aforementioned proprietary databases, which signals the importance of careful data curation and chemical space consideration that Avdeef advocated. Within the aforementioned challenges, additional high-quality solubility data were published. With the availability of efficient and reliable machine learning methods as well as the ever increasing in computing power in HPC environments, more precise and faster learning models are available nowadays. Our goal in this work was to conduct a large-scale machine learning study to investigate how one can achieve robust predictions while retaining minimum model complexity.
For this purpose, we curated a novel intrinsic solubility dataset from literature sources. For the machine learning tasks, we used boosting and bagging ensemblers as well as partial least squares (PLS) and least absolute shrinkage and selection operator [LASSO] methods. The last two being established machine learning modes that are often neglected over seemingly more powerful ensemble regressors. 33 Consensus modeling was employed to build a final QSPR model. Finally, we discussed the use of permutation importance for a multistage feature selection, the relationship of metrics within data splits, and the relevancy of commonly used feature preprocessing/preselection and data splitting paradigms. Furthermore, we present a more challenging test set to test the models' extrapolation capabilities.

| Data collection and processing
We have collected aqueous solubility data from the following literature sources. 4,12,15,16,18,22,[34][35][36][37][38][39][40][41][42][43][44][45][46][47][48][49][50][51][52] The decision criteria on which literature to include for our study is initially based on the recommendations in the revisited solubility challenge. 3 Subsequently, we looked for additional literature sources where authors have included pH, which were measured between 22.5 C and 25 C temperature and used inert gases (argon, nitrogen) in their measurements. Most of the above-mentioned solubility data sources refer to the intrinsic aqueous solubility (logS 0 ), while others refer to the aqueous solubility (logS w ). For each compound, SMILES strings were retrieved from the name either through PubChem (https://pubchem.ncbi.nlm.nih.gov/), JChem (Marvin/JChem v20.9.0, ChemAxon, Budapest, Hungary), or via their CAS numbers (https://cactus.nci.nih.gov/translate/). SMILES strings were curated 53 and standardized to isomeric SMILES using the ChemAxon Standardizer (v18.28.0, ChemAxon, Budapest, Hungary) and the RDKit library. 54 We filtered compounds with the following properties: logP 55 in [À3. 6, 7.5], molecular weight larger than 88 g/mol, and structures with more than six heavy atoms. These ranges were determined according to the data published in the solubility challenges. 3 The obtained logS w values in the extracted data were converted to logS 0 based on their formal charges as suggested by Abraham and Le 46 and Avdeef. 56 Because we had multiple values for intrinsic solubility per molecule, we removed the duplicated values and averaged the rest. In total, out of the 829 compounds in the final data set, 446 had originally logS 0 values, whereas for the other 383 compounds, we have calculated the values from logS w .
The data preparation pipeline is depicted in Figure 1. We calculated and considered in modeling two types of predictive features: fingerprints (FPs) 57 and molecular descriptors (DPs) (calculated using DRAGON 6.0-Talete, Milano, IT). We chose FPs with a comparatively short radius of 3 bonds and large vector length of 5120 bits, to avoid bit collision as suggested by Landrum. 58 From the available $5000 DRAGON molecular DPs, only a few groups of DPs were selected based on chemical intuition, specifically, constitutional, ring, topological DPs, functional group counts, and molecular properties. All DPs with missing values were removed. Such a preselection procedure yielded a total of 317 molecular DPs. A combination of FPs and DPs (FPDS) was also evaluated (5444 features in total).

| Evaluated machine learning methods
For development of intrinsic solubility models of chemical based on their structure, four regression algorithms different in their paradigms were applied: (i) LASSO, 59 (ii) PLS, 60 (iii) RF, 61 and (iv) LightGBM. 62 All four are briefly summarized in the subsequent subsections.

| Least absolute shrinkage and selection operator
LASSO regression is a multivariate chemometric method, which involves the L 1 -penalty for regularization. 59 Given the multiple linear regression formulation with standardized features/predictors X (N, p) and response variable (N, 1) y, LASSO aims to solve the L 1 -penalized regression problem of finding a set of p model coefficients β = {β j } to minimize: where N is the total number of cases (compounds) in the training set and λ is the penalty term. In a linear regression model having constant term, the number of predictors (features, DPs) involved is equal to p À 1. Because of the form of the L 1 -penalty, LASSO inherently performs feature selection and shrinkage at the same time returning an extremely sparse coefficient matrix.

| Partial least squares
PLS regression is a chemometric method that aims to reduce the dimension of both the predictors (X-space) and the dependent variables (Y-space) by compressing them into latent variables (LVs). LVs are constructed in the direction of maximum correlation between X-and Y-spaces, where one wants to find the multidimensional direction in the X-space (predictive variables [N, p]) that explains the maximum multidimensional variance direction in the y (target variable [N, 1]). Readers are referred to Bro 60 for a more detailed overview.

| Random forests
The RF algorithm, conceptualized by Breiman, 63 creates a large collection of decorrelated decision trees by using bootstrapping aggregation. The final prediction results are thereby averaged from a multitude of decision tree regressors; this reduces the bias in the models, whereas variance can be controlled by carefully optimizing weak learner hyperparameters, such as tree depth. Besides their good performance, RF and other decision tree-based learners accept many feature representations and are associated with reduced preprocessing efforts, making them convenient for use in many applications, including manufacturing. Because trees in RF get trained in parallel, a significant advantage of RF is the speed when compared with boosting ensemblers.

| LightGBM
Light Gradient Boosting Machine (LGBM) 62 is a framework using the decision tree as a base algorithm.
LGBM uses the first-order derivative information when optimizing the loss function. The leaf growth strategy with depth limitation and multithread optimization in LGBM contributes to solve the excessive memory consumption with respect to other boosting-ensemble machine learning methods.
LGBM was selected to reduce the computational cost of calculations compared with other boosting ensemblers.

| Feature selection
In this work, we applied a multistage post hoc feature selection. The strategy is based on permutation importance 64 for eliminating features. 65 Using each of the trained models, the method permutes the values of individual features (one-by-one) to assess the relevance of the features with respect to the response vector (logS 0 ). The relative decrease in RMSE in a pretrained model caused by a permuted feature is considered a "weight." The permutation procedure was repeated 10 times for the feature matrix and averaged to a permutation importance vector. A cut-off value of 0.001 for the average weight was chosen. The feature elimination procedure was conducted in multiple stages. Models were trained, and then a set of features was eliminated either by having an average weight above the cut-off or the number of features used in the next stage were reduced to one third of the number of features, whichever was smaller. The models from each stage were included in the performance evaluation.

| Hyperparameter optimization
For hyperparameter optimization in machine learning, random and grid searches over hyperparameter spaces are used very often. 66 Because hyperparameter space can be large either by means of number of parameters or grid-points included, the procedure can suffer from large computational cost even with parallel computing. 67 Local optima in the parameter space are difficult to avoid if the grid is not dense enough with properly set parameter ranges. In this work, we applied Bayesian optimization (BO) 68 for hyperparameter optimization with RMSE (Validation) as a loss function. BO aims to construct a posterior distribution of functions (Gaussian process) that best describes the loss function. As the number of observations grows, the posterior distribution becomes narrower, and the algorithm becomes more certain of which regions in the parameter space are worth exploring and which are not. In the process of parameter optimization, the model is continuously trained, and the regression results obtained by each parameter combination are evaluated. Finally, the optimal parameter combination is obtained when a stopping criterion is reached (predefined number of iterations).

| Model training
To train the models, the datasets (logS 0 and the predictive sets) were split following two strategies: randomly and by means of diversity picking (a method of picking diverse molecules into subsets by means of their FP similarity). 69 For both splits, the external test set was set to 20% of the whole data set a priori (Table S2; previously published at Lovri c et al. 70 ), and the remaining 80% were further split by one of the two strategies into training (80%) and validation (20%) sets. We trained the models with (i) three options for the predictive features, namely, FP, DS, and a joint data set of FPDS; (ii) two splitting options: random or by diversity picking; (iii) four ML algorithms; (iv) with and without multistage feature selection; and (v) with and without feature preprocessing. The code for the preprocessing method (available at https://github.com/mariolovric/solubility) comprises the following sequential steps: removing features with any missing values, removal of correlated features (Pearson correlation > 0.85), separation of categorical features (from binary and continuous) and their conversion to binary features (based on binning to four "dummy" bins), and removal of low variance binary features (lower than 1% variance). The parameters of the ML models were tuned using BO for each of the named combinations. The available parameter space (upper and lower bounds) per algorithm can be found in the code repository. The models were trained on the training set and validated on the validation set during BO. RMSE computed out of the validation set was used as a loss function for BO. The optimization experiment ran for $48 h on a virtual machine with 24Â Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with 30 GB of RAM. We also followed per iteration results on the external test set, to later on report the estimated generalization performance. Apart from LASSO, which has an internal regularization of the feature space, the models were trained iteratively with the permutation importance feature selection strategy multiple times, with each time transferring the feature list to the next model sequentially. Such modeling pipeline is depicted in Figure 2. Finally, the best models were chosen based on a ranking schema, which we believe it reflects an objective model evaluation. In Equation 2, the weights were chosen in such a manner that performance on the test is given the largest importance, followed by complexity expressed through the number of features and two terms representing generalization all combined in the average rank Rk M . . All ranks are sorted ascending.
where R features is the rank based on the total number of features involved in the model and R RMSE(test) is the rank of RMSE of the respective test set, whereas Δ val and Δ train are defined with Equations 3 and 4, respectively. Both terms account for the generalizability of the models.

| RESULTS AND DISCUSSION
In this work, we have compared four machine learning methods for prediction of aqueous solubility. Namely, PLS, LASSO, RFs, and LGBMs. PLS is an LV method in which predictors X are correlated to the dependent variable y by compressing both into LVs. The LVs are extracted by maximizing the variance in both X and y, as well as correlation between them. PLS is suitable for very intercorrelated data such as spectral information, has generally good generalization ability, and is able to deal with datasets with larger number of features than observations. LASSO is a method in which an L 1 -penalty is introduced for regularization with inherent feature selection, which makes it robust and also able to handle high-dimensional data. However, LASSO and PLS can be quite sensitive to outliers. RFs belong to a family of ensemble (nonlinear) methods where a series of weak learners are trained and aggregated with the aim of building strongly predictive models. The fourth algorithm is LGBM, a gradient boosting algorithm in which first-derivative information is used while computing the loss function for generation of the ensemble model. It possesses similar regression features as RFs. Finally, consensus models can be built out of the best regressors to further improve predictive ability, generalization, and robustness.

| Model optimization results
Detailed results of all trained models are summarized in This model, ranked by RMSE(test), was followed by five (second to sixth position in Table S3) RF models with some of them comprising as few as 16 features. The first PLS model appeared on the seventh place comprising 33 original features (10 latent features). The best LGBM model by means of RMSE(test) was ranked 15th comprising 47 features. Figure 3 depicts the contributions of the choice of predictors, algorithm, and the splitting method.
It can be observed that the FP-based solubility models have generally underperformed when compared with the models built out of molecular DPs or their combination. The models based on FP also exhibit a large spread in regard to RMSE(test). This outcome could have been expected because none of the four algorithms (PLS, LASSO, LGBM, and RF) creates metavariables (hidden layer abstract molecular representations) out of the FPs like deep neural networks do in the hidden layers that contribute to their predictive ability. 71 Furthermore, with the addition of FPDS, only marginal improvements can be observed.
LGBM shows a notably larger spread compared with other algorithms (Figure 4), which can be explained by evident overfitting on the train set and lower predictive ability on the test set.
F I G U R E 3 Distribution of testing set errors for the four evaluated machine learning algorithms in cases when two algorithms are used for training/test set partition. Differences between random train/test/validation split and diversity picking are depicted with green ascending and red descending line patterns, respectively. Mean values of the testing errors are depicted with green and red circles, whereas the outliers are depicted with green and red upwards-facing triangles, for random train/test/validation split and diversity picking, respectively F I G U R E 4 Generalization ability and robustness for all the models trained in this work. The RMSE(test) /RMSE(train) ratio depicted in this figure was grouped based on the method used (RF, PLS, LASSO, LGBM) for model development and three sets of predictive variables. Differences between random train/test/validation split and diversity picking are depicted with green ascending, and red descending line patterns, respectively. Mean values of the testing errors are depicted with green and red circles, whereas the outliers are depicted with green and red upwards-facing triangles, for random train/test/validation split and diversity picking, respectively Such performance decrease is not caused by the optimizer being stuck in local optima, as evident from Table S4 where optimal hyperparameters of LGBM vary considerably in each run.
Even though the LGBM is a powerful algorithm, it has a large variety of hyperparameters, and finding the right set of those can appear troublesome. The spread of RF tends to be smaller than LGBM, which can be explained by the bagging + decorrelation paradigms, which can help in avoiding any local optima during BO. In our previous work, we observed boosting ensemble methods also underperforming when compared with the bagging ensemblers. 33,72 Overall, the spreads per algorithm in Figure 4 are larger for the FP and FPDS predictive sets. This might be explained by randomness that FPs can introduce by having a train or test bit with all zero values impeding convergence.
Herein, we also evaluate the contribution of the data-splitting strategies. RMSE(val) values for models with datasets split via diversity picking can be as low as 0.53 (Table S3). Nevertheless, the highest ratios of RMSE(test/val) (above 1.2) are all originating from diversity-picked data splits. Diversity-picking leads to similar train and validation set that points to an overestimation of the model quality on any external test set. Therefore, the validation or other cross-validation metrics for models with diversity-picking-based splitting can point to lower generalization/robustness. Based on Δ train (Equation 4), LASSO is overall the best performer. PLS performs well in terms of both generalization metrics. RF models exhibited overfit but in a lesser extent than LGBM. Table 1 summarizes the 10 best models according to the Rk M metric only for random splits, because we have shown that diversity-picking can deviate the impression in generalization. Even though the LASSO model has the best score by RMSE(test), it has a high number of features, which is deteriorating its Rk M score. Because the LASSO algorithm is penalizing the coefficients, it can perform well with a high number of features if it sets the coefficients close to zero, which was the case with this model. The coefficients are in a range À0.38 to 0.29 with $42% coefficients being in the range from À0.01 to 0.01. A coefficient plot is given in Figure S1.
The Rk M metric was chosen in such a manner as to create a simple model by means of the number of features and a good result on the (external) test set but still taking into account generalization/robustness (see Equation 1).
By means of Rk M , a RF model using 17 features was ranked as best. The predictive ability of the two best models based on RMSE(test) and Rk M is depicted in Figure 5A,B, respectively. Out of the 10 best models by Rk M , four are RF and four are LGBM, the rest being LASSO. Interestingly, there are two LGBM models using two and three features for training. Even though not ranked as the best, they exhibited reasonable generalization. Eight out of these 10 models are not using preprocessing (part of the grid search), which shows that ensemble methods work well with the original data as preprocessing can remove valuable information. None of the best models was based on FPs. The models in Table 1 were based either on DPs or the combined with FPs. The R 2 values for the two best models are 0.80 (LASSO) and 0.78 (RF).
It is worth pointing that out of the two best models (LASSO and RFs), a consensus model was built outperforming all the evaluated models with RMSE(test) of 0.67 log points (R 2 of 0.81).

| Comparison of model scores
The aim of modeling is to develop a model by which we will be able to predict a modeled activity of an external (unseen) set of molecules. For this reason, it is of utmost importance to estimate generalization based on known performance of model obtained in training and validation procedures. We have therefore compared here the RMSE values for 158 models (separately for splitting methods), that is, the scores on train, validation, and test set. Additionally, we have calculated the average of RMSE(train) and RMSE(val) as RMSE(train, val). The results are shown in Table 2.
The comparison of results for randomly split data shows a correlation of 0.85 for train test and 0.87 for val test. The same comparison of models where train and val were diversity picked shows a somehow lower correlation of 0.77 for RMSE(train) À RMSE(val) with RMSE(val) À RMSE(test) being the same at 0.87. The reader is reminded here that the test set is a true external set that was split a priori. Only train À val splits were tested by the splitting strategies. Generally, a better prediction of intrinsic solubility for external set of compounds can be achieved if the model is validated (during model optimization and development) by means of cross-validation or a validation set in which the training set was split randomly into validation subsets. We propose that the diversity picking, another splitting algorithm applied in this study, can lead to overly optimistic results. A correlation of 0.87 in RMSE(train) À RMSE(val) within the diversity picking split supports this further, compared with the correlation of RMSE(train) À RMSE(val) in random split, which is at 0.71 (a lower correlation) meaning the distribution of the train and validation sets differs slightly. It is shown here that a drift in the distributions of train and validation sets can lead to a better generalization (on the true test set). Even more interesting is that in random splitting, the average of the train and validation by means of RMSE(train, val) delivers a good overview of the generalization of the model because they show a correlation as high as 0.92 to RMSE(test). Therefore, we suggest the use of RMSE(train, val) after they were randomly split for choosing models with good generalization on external unseen data.
F I G U R E 5 Predictive ability of the two best intrinsic solubility QSPR models from Table 1

| Feature importance
Careful analysis of the involved features for all the models in this study showed some interesting patterns (Table S3). First, the PLS models in general did not reduce to as few features during the feature selection as RF or LGBM. Second, LASSO mostly converged to subsets of 50-100 features. The multistage feature selection was not used in the case of LASSO as feature selection is inherent to this technique. Third, RF models have overall exhibited a reasonable model quality with a smaller number of features. This points to a fact that RF seems more efficient in removing features due to its bagging and decorrelation paradigms. The best model by means of Rk M was refitted with the resulting features and the resulting parameters. The retrained model was subjected to permutation importance, the results of which are depicted in Figure 6. Table 3 summarizes the descriptions of DPs involved in the best final RF model (Table 1), selected using the permutation importance strategy.
Detailed descriptions of all the DPs can be found in Todeschini and Consonni. 73 The analysis of the permutation importance of the DPs in Figure 6 shows that the best RF model is most sensitive to the order of values of the SCBO F I G U R E 6 Mean permutation importance for 1000 random resampling runs of the best model with 17 features (RF model from Table 1) T A B L E 3 Full names of descriptors selected into the final/best RF model from Table 1 Descriptor Description

| Physical interpretation of the relation between molecular DPs and aqueous solubility
The above analysis clearly established certain quantitative structure-aqueous solubility relationships for drug compounds. However, the question is: What is the physical interpretation of the correlating between the molecular (structural) parameters and the aqueous solubility? Here, we attempt to provide the physical interpretation of the top five important molecular DPs ( Figure 6).
1. SCBO: The sum of conventional bond orders in a molecule is related to the size (or molecular weight) of a compound, as well as to the total number of hydrogens in it. In general, larger (organic) molecules are less soluble in aqueous medium because it is more difficult for water solvent molecules to surround the larger molecules. Therefore, this strong correlation is expected. 2. D/Dtr 06: This descriptor describes the cyclic character of the evaluated molecules in terms of topological patterns that allow one to compare the cyclic complexity of structures, namely, the number of molecule cycles and the manner in which the cycles are connected. As the cyclic character also relates to the size of a solute, negative correlation with aqueous solubility is also anticipated. 3. AMR: Molecular refraction, a measure of the total polarizability, is often used as a solubility parameter, for example, Abraham solvation parameter model. Good correlations between solubility parameters and refractive indices have been reported. Hence, AMR is believed to be a good molecular descriptor of aqueous solubility. 4. MLOGP: The octanol-water partition coefficient (logP) is a ratio of the solubilities of a solute in a two-phase octanol/water system, which is an important index in measuring solubility. This is an obvious parameter correlates well with aqueous solubility of drug molecule. 5. TPSA (total): The polar surface area (surface sum over all polar atoms) represents potential area of a molecule that interacts with water molecule as a solvent. A large total polar surface area of a solute indicates stronger solvation in an aqueous medium. Thus, it is an important molecular descriptor to quantify the solute-solvent interaction of a drug molecule in aqueous environment.
Our result here confirms the valuable roles of constitutional, topological, geometrical, and electronic DPs to predict the aqueous solubility. Some of the selected DPs were utilized in many solubility prediction studies, for example, MLOGP (logP) as the most frequent by appearance in the literature, 16,17,[74][75][76][77] as well as other top DPs from Table 3 like the total number of carbon atoms (nC), 78 TPSA, 32,76,77 SCBO, 75 AMR (MR), 17,32 and the number of aromatic atoms (here nC ar ). 32

| Evaluation of the models' extrapolation capabilities on a more challenging test set
In order to test the extrapolation capabilities of the models, we have introduced a more challenging test set, that is, an extreme-case scenario. For this purpose, we have principal component analysis (PCA)-transformed FP data to LVs (principal components). The three components explain only $28.35% of total variance. This however does not affect the research issue at hand, which is the creation of a different train-test split which should reveal extrapolation capabilities of the winning models. Prior to PCA, low-variance FPs were removed (below 0.05). The centroid of the PCA space (three dimensions) was calculated as well as the Euclidean distances of all compounds to the centroid. The Euclidean distances to the centroid were sort and split at 80 percentiles of distance. All compounds below 80 percentiles were set as a new train set (PCA-train) and above as a new test set (PCA-test). The validation set is subtracted, so the number of compounds corresponds to the other splits (529 in PCA-train, 167 in PCA-test). The PCA-split space is depicted in Figure 7. The splits were subjected to the two winning models presented in Figure 5 (LASSO and RF); that is, the same hyperparameters and features were utilized, but the model was retrained on the PCA-train set and evaluated on the PCA-test set.
The LASSO model ( Figure 5A) had here an RMSE of 1.31 log-points on PCA-test, which is a large increase of error compared with the random split that results in a RMSE of 0.69. The RF model scores an RMSE of 0.89 on PCA-test compared with 0.72 obtained on the randomly split test set ( Figure 5B). It is interesting that the RF model performs better in such an extreme-case scenario. The LASSO model was chosen in the first place based on its test set performance, while the RF model was chosen based on Rk M , which is including also performance on train and validation sets and therefore present a better tool for estimating generalization. This confirms the appropriateness of using quality estimation by means of Rk M but also the importance of challenging the models with extreme-case scenarios such as this. It could be expected that similar descriptors utilized in models (which are presented in Figure 6) and a worse RMSE(test) of RF comparing to LASSO (see Table 1) would deteriorate the extrapolation capability, which was not the case since RF performed better in this more challenging task. This supports our discussion that the ensemblers can stabilize the models in case of descriptor redundancy.

| Limitations of the machine learning approaches for prediction of solubility
This study was designed as a multifactor evaluation for training machine learning models for the prediction of solubility. Some conventions like removal of collinear features were varied as a segment of a grid search to evaluate whether that might have an influence on model performance. The top models summarized in Table 1 have shown that eight of 10 models were those run without extensive preprocessing. Interestingly, the models with redundant features included fared better and had better predictive performance than the models that involved extensive preprocessing. This points to the fact that some machine learning models do profit from redundancy, at least those with intrinsic feature prioritization such as ensemble learners.
The best RF model in our evaluations has shown a slight bias on the test set. One potential cause of such bias can be attributed to the chosen metric for evaluation (RMSE-on the testing set in our case) because the research community still did not fully agree on the model quality metrics to be used. 31,79 This also makes comparison of research works and models published in literature challenging. Some biases can be avoided by using other or F I G U R E 7 PCA scores plot for the intrinsic solubility data. Molecular fingerprints are transformed onto three axes (PC1, PC2, PC3). Black X marker is the centroid of the space. Data points are colored by means of Euclidean distance from the centroid. Molecules that are in the 80 percentiles closest to the centroid are colored in green, whereas those further apart are colored in red weighted metrics. In this work, we limited ourselves to the RMSE, to avoid ambiguity in the decision-making process. Furthermore, bias can also be introduced by the experimental data itself. Literature suggests that the standard deviation in solubility laboratory measurements increases with decreasing intrinsic solubility. 17 Even though we limited ourselves to data where measurements are well described, there is a lack of coherence within data sources, which is described previously in literature. 17,56 Nevertheless, the RF model showed better extrapolation in the extreme-case scenario on the PCA-test set, which supports the use of proposed ranking methods and the stability of model regardless of the redundant features.
Even though we suggest two winners, the LASSO model by RMSE(test) and the RF model by Rk M , both are arbitrarily chosen criteria because (a) our ranking approach is a heuristic reasoned by weighting and (b) RMSE is chosen due to the its higher robustness comparing to the correlation coefficient, 17 but there are also other model quality metrics that can be used. It is important to note that the winning models have marginal improvements over follow-up models in the ranks. Presented results appear to converge to the values of RMSE(test) $0.7, which may suggest low structure-based information content involved in the calculated molecular features involved or certain limitations of quantitative structure-activity relationship (QSAR) predictive approaches that were used in this study. Even though the two predictive challenges 4,80 were 10 years apart and the second had an improved data quality, but very poor (or no) improvement has been achieved by means of the use of advanced machine learning models. Our approach with the ensemble models led us to be among the top performers (MLKC team) in the 2019 solubility challenge. 80 However, we acknowledge that limitations were reached for predictive capabilities QSAR models and the most popular chemical representations such as molecular DPs and FPs, which are also utilized in this work. Furthermore, we have curated our own data set to increase the size of the data, which can lead to error propagation because not all data sources have the same reliability.

| CONCLUSIONS
In this work, we tested the effects of multiple factors affecting machine learning outcomes in order to obtain the best prediction for intrinsic aqueous solubility. Besides the four regressors, namely, LASSO, RF, LightGBM, and PLS, we tested the effects of feature selection by means of permutation importance, the type and size of chemical representation (FP and molecular DPs), Bayesian optimization, and two data splitting options. The intrinsic solubility data used here is a novel collection of curated values and structures obtained from literature with 829 drug-like compounds. The best model by means of predictive performance on external test set is a LASSO regressor based on 105 features giving a RMSE of 0.7 (log units) in prediction on an external test set of organic compounds. Nevertheless, we proposed a ranking schema for choosing the best models based not solely on the measure's performance on a fixed test set but also by taking into account the number of features and the estimated generalization performance estimated on the training and validation sets. The rankings reveal a clear dominance of the RF algorithm because it can predict well with less features involved but has also a better performance on the more challenging PCA-split test set. Even though LightGBM is a powerful algorithm, it has a complex hyperparameter space, which is hard to optimize and was working in the overfitting regime in most cases. We show that there is no single criterion, data set, nor algorithm that can cover it all but rather a multiverse of possibilities and decisions to be embraced for building robust models with strong generalizability.