Uniting Cheminformatics and Chemical Theory To Predict the Intrinsic Aqueous Solubility of Crystalline Druglike Molecules

We present four models of solution free-energy prediction for druglike molecules utilizing cheminformatics descriptors and theoretically calculated thermodynamic values. We make predictions of solution free energy using physics-based theory alone and using machine learning/quantitative structure–property relationship (QSPR) models. We also develop machine learning models where the theoretical energies and cheminformatics descriptors are used as combined input. These models are used to predict solvation free energy. While direct theoretical calculation does not give accurate results in this approach, machine learning is able to give predictions with a root mean squared error (RMSE) of ∼1.1 log S units in a 10-fold cross-validation for our Drug-Like-Solubility-100 (DLS-100) dataset of 100 druglike molecules. We find that a model built using energy terms from our theoretical methodology as descriptors is marginally less predictive than one built on Chemistry Development Kit (CDK) descriptors. Combining both sets of descriptors allows a further but very modest improvement in the predictions. However, in some cases, this is a statistically significant enhancement. These results suggest that there is little complementarity between the chemical information provided by these two sets of descriptors, despite their different sources and methods of calculation. Our machine learning models are also able to predict the well-known Solubility Challenge dataset with an RMSE value of 0.9–1.0 log S units.


■ INTRODUCTION
Poor aqueous solubility remains a major cause of attrition in the drug development process. Despite theoretical developments, the solubility of druglike molecules still eludes truly quantitative computation. In recent work, 1 we have shown that accurate first-principles calculation is now becoming possible, provided that both the crystalline and solution phases are described by accurate theoretical models. Before this, energy terms from a computed thermodynamic cycle (see Figure 1) had been used as descriptors in a multilinear regression model for intrinsic solubility, delivering accuracy much better than from direct computation and comparable with the leading informatics approaches. 2 Since then, sophisticated machine learning techniques have been applied to many problems in the chemical sciences, while, as we have shown, 2,3 the accuracy of direct computation of hydration energies and solubilities has improved significantly. This led us to revisit the idea of hybrid informatics-theoretical models for solubility.
Cheminformatics methods have seen widespread use for property prediction, particularly in the pharmaceutical industry where they have been applied to; aqueous solubility, melting point, boiling point, log P (where P is the partition coefficient between octanol and water), binding affinities, and toxicology predictions. 4 Such methods are usually much quicker than pure chemical theory calculations, making high throughput virtual screening (HTVS) a possibility. Some methods have become accessible and easy-to-use web-based tools. 5 However, informatics methods suffer from the difficulty of decomposing the results into intuitive, physically meaningful understanding and cannot reflect the physical details of the system. To understand the underlying physics and chemistry, it is necessary to carry out an atomistic physics-based calculation.
Many chemical theory methods have been developed to specifically address one phase. The exact nature of the theory varies between these methods and the phase being studied. Crystal structures are often modeled using one of the lattice energy minimizing simulation methods, 6 plane-wave density functional theory (DFT) methods, 7 or periodic DFT using atom-centered basis sets. 8 The latter two methods come from a quantum-chemical standpoint. The results are often very good but have a high computational cost. The simulation methods often contain empirical parameters, which lowers the cost of these methods significantly, compared to DFT.
Popular solution-phase models include atomistic simulation methods based on molecular mechanics and dynamics, 9 quantum-mechanical implicit solvation methods (such as the polarizable continuum model (PCM)), 10 and "hybrid" models (such as the classical statistical mechanics-based reference interaction site model (RISM) 11 or hybrid quantum mechanics/molecular mechanics (QM/MM) methods 12 ). These methods have the inherent problem for industrial and drug discovery applications of being significantly more computationally intensive than cheminformatics models, which makes high-throughput computation infeasible. The closest thing to an exception among contemporary theoretical models may be 1D RISM, which requires only a few minutes of calculation time per compound and has been previously combined with cheminformatics to build the 1D-RISM/SDC method. 13 By combining lower levels of theoretical chemistry with cheminformatics, we hope to produce results in good agreement with experiment, but at a lower cost than higherlevel theoretical methods, and with higher accuracy than using cheminformatics descriptors alone.

■ METHODS
Molecules and Solubility. A set of 100 broadly druglike organic molecules was assembled with the prerequisites that each molecule should have an available crystal structure in the Cambridge Structural Database (CSD) 14 and a well-documented aqueous intrinsic solubility in the literature. Where possible, we prefer experimental solubilities obtained with the CheqSol method, 15 which has been shown to give reproducible results with only small random errors. The possibility of significant systematic errors between different experimental methodologies remains an issue and may possibly limit the accuracy with which modeling-based studies can be validated.
A total of 122 potentially useful CheqSol solubilities were obtained from the two Solubility Challenge papers 16 and downloaded from the Web. 17 While noting that several corrections had previously been made, we also corrected or disambiguated the following names: amitriptyline, 5-bromogramine, 5,5-diphenylhydantoin, 4-hydroxybenzoic acid, nortriptyline, and phenanthroline. Of the 122 compounds, 38 had corresponding crystal structures and could be included in our DLS-100 dataset. Where a choice existed, we selected the solubility and crystal structure of the least soluble and, therefore, most stable polymorph. For druglike compounds with known crystal structures, one further CheqSol solubility was available from Palmer et al. 2 and two from Narasimham et al. 18 We sourced solubility data for an additional 59 compounds from other experimental methods. 19 This gave us a total data set of 100 molecules.
The crystal structures were obtained using either the CrystalWeb 20 interface or the ConQuest 21 interface. Crystal structures were selected on the basis of stability, preferring the polymorph with the lowest literature solubility or the lowest lattice energy according to our computations where polymorph-specific experimental information was not available. We also applied the additional pragmatic selection criterion that the asymmetric unit cell should contain only one molecule. Once structures were identified, they were downloaded in either the SHELX format (.res) or CSD legacy format (.dat).
We chose to use Chemistry Development Kit (CDK) 22 molecular descriptors in this study, because these descriptors do not require proprietary software and are applicable to solubility prediction. 23 The CDK is an open source cheminformatics Java library. In order to use the CDK molecular descriptors, 22 we required each of our chemical structures in SMILES format. As noted by O'Boyle, 24 SMILES can be ambiguous. We thus decided to use one principal source for SMILES records, selecting the well-annotated database ChemSpider. 25 Since we are modeling intrinsic solubility, we wish to describe the neutral form of the druglike compound. This remains the case even if a protonated or deprotonated charged form dominates at neutral pH or across the pH range of the CheqSol (or other) experiment. To obtain a SMILES string for each molecule in the DLS-100 dataset, we wrote a Taverna workflow, 26 which uses web services provided by the ChemSpider database. 25,27 The workflow is freely available on the MyExperiment 28 repository at the following reference. 29 In five cases, we found the ChemSpider SMILES to correspond to an undesirable protonation state. Thus, we instead took the SMILES from the solubility challenge Web site 17 for cimetidine, pindolol, and phenobarbital, and from Wikipedia for griseofulvin 30 and glipizide. 31 Using the resulting 100 SMILES, we initially calculated all 268 available nonprotein CDK descriptors for each compound. We found that 145 of these descriptors were either undefined for 2D structures, or had the same value for all 100 compounds; their deletion left 123 remaining descriptors.
Crystal Structure and Gas-Phase Calculations. We took experimentally determined crystal structures of the compounds in our DLS-100 dataset as the initial input to our calculations. DMACRYS, 6 a periodic lattice simulation program, was used to perform the crystal structure minimizations and calculate vibrational contributions arising from the crystal. DMACRYS works in conjunction with the GDMA2 32 and Gausssian 09 (G09) programs. 33 The output of these calculations gives us the enthalpy of sublimation and crystal portion of the entropy of sublimation.
The selected crystal structures were input into DMACRYS, which was used to standardize the covalent bond lengths between hydrogens and heavy atoms, as the experimentally determined bond lengths are not accurate, because of the uncertainty in the hydrogen positions obtained by X-ray diffraction, before any calculations were run. Electrostatic interactions were calculated by multipole expansions 34 (obtained using GDMA2) of molecular charge distributions calculated at the MP2/6-31G** level using G09. Multipolar expansions up to hexadecapole were calculated. Intermolecular repulsion and dispersion were calculated by a Buckingham potential. 6,35 DMACRYS carries out a rigid-body minimization of the crystal structure, hence arriving at minimized lattice energies. This lattice energy can be converted to an enthalpy of sublimation by the following formula: Enthalpy of sublimation: where U latt is the lattice energy (energy of the crystal assuming the crystal is static and at 0 K relative to infinitely separated molecules) and the −2RT term arises from lattice vibrational energy. 2,36 The entropy of sublimation was calculated by: Entropy of sublimation: where S rot is the rotational entropy in the gas phase and S trans is the entropy of translation in the gas phase. S crys is the entropy of phonon vibrations within the crystal. The use of eq 3 makes these assumptions: (i) the rotational and translational entropy of the crystal is minimal, (ii) there is no change in electronic entropy between phases, and (iii) the intramolecular entropy is constant between the two phases. The crystal entropy is calculated by locating the frequencies of the phonon normal modes (lattice vibrations) at the gamma point. This is achieved using lattice dynamics, the results of which are used to calculate the Helmholtz free energy (see eqs S2 and S3 in the Supporting Information).
Gibbs free energy: The coordinates of a single molecule were extracted from the minimized lattice and used as input for the gaseous optimization with G09. Optimizations were carried out at the M06-2X and HF levels of theory with a 6-31G* basis set. The gas-phase entropy values were calculated from statistical thermodynamics in G09. Finally, ΔG sub is calculated from the enthalpy and entropy of sublimation.
Solution-Phase Calculations. All solution-phase calculations were carried out with G09 using the Self-Consistent Reaction Field (SCRF) protocol. We selected the SMD (Solvation Model based on Density) 37 implicit solvent model based on previous work. 1 Although RISM yielded moreaccurate absolute hydration energies than SMD in our recent work, 1 SMD generated a higher correlation coefficient against experimental results for hydration free energy prediction (R = 0.97 vs R = 0.93). Given the parametrized nature of our present model, correlation is more important than absolute agreement, and, hence, SMD is a suitable solvation model. Solution-phase calculations were carried out with the same methodologies as used in the gas-phase calculations, M06-2X/6-31G* and HF/6-31G*. Geometry optimization was again carried out, this time taking the gas-phase optimized structure as the starting point. The SMD model is a parametrized implicit solvation model. SMD solves for the free energy of solution (ΔG hyd ) as a sum of the electrostatic contributions and nonelectrostatic contributions. The electrostatic contributions are calculated by the solution of the nonhomogeneous Poisson equation; 23,37 this equation is a second-order differential equation linking the electrostatic potential, dielectric constant, and charge distribution. The nonelectrostatic contributions of cavitation, dispersion, and solvent structure are calculated as a sum of atomic and molecular contributions using parameters inherent to the SMD method. SMD has been shown to provide significant improvements over some other implicit solvent models for datasets containing molecules similar to those used in this study. 1 The hydration free energy is given by eq 4, Gibbs free energy of hydration: where E solution is the total energy of the system in the SMD solvation model and E gaseous is the total energy of the system in a vacuum. Scheme 1 represents the workflow for making such predictions. Standard States. Sublimation energies were calculated in the 1 atm standard state, which is the conventional standard for experimental sublimation energies to be quoted. However, solvation free energies are usually quoted in the Ben-Naim standard state of 1 mol/L. In this work, ΔG°corresponds to the 1 atm standard state, while ΔG* corresponds to the Ben-Naim 1 mol/L standard state (see Figure 2). 38 The difference between these two standard states is a constant energy value of 1.89 kcal/mol (7.91 kJ/mol). In this work, we calculate the sublimation free energy in the 1 atm standard state and then apply the correction to 1 mol/L in order to be consistent with the hydration free energy calculations; hence, ΔG solu is in the 1 mol/L standard state for all predictions in this work.
Theoretical Log S Prediction. Our final solution freeenergy prediction is then given as the sum of the predicted sublimation and hydration free energies: Gibbs free energy of solution: Therefore, we have two predictions for each molecule: The first method couples DMACRYS with G09 and the SMD solvation model at the HF/6-31G* level of theory. This model will be referred to as SMD(HF). The second method is DMACRYS coupled with G09 and the SMD solvation model at the M06-2X/6-31G* level of theory. This will be referred to as SMD(M06-2X).
For convenience of comparison with experimental values of solubility, we convert the free energy of solution to log S values, and all experimental solubility values to log S values: Here, R is the universal gas constant and T is the absolute temperature (in Kelvin). The conversion of experimental solubility to log S can be found in the Supporting Information (eq S7). Values for the full DLS-100 dataset, including SMILES and InChI, can be found in the Supporting Information (see zip file and dataset).
Informatics Models. To model the data, we use linear and machine learning regression models: partial least-squares regression, random forest and support vector regression. For reporting the predictive accuracy of these models, we averaged the RMSE of log S over a 10-fold cross-validation of the DLS-100 dataset. The cross-validation fulfils two purposes in this study: parameter optimization and evaluation of the accuracy of the models on unseen data. To ensure that each test fold of data is truly unseen, the parameter optimization is carried out in a separate layer of cross-validation within the training folds, as we will discuss below. In order to avoid overfitting, the data are preprocessed before building the predictive models.
Data Preprocessing. The use of multivariate data presents a danger of overfitting machine learning regression models; moreover, redundancy of attributes and correlation within the data add to the risk of reaching misleading conclusions. 39 To avoid such issues, we have used two normalization methods. One is the commonly used standardization method of variable scaling, equalizing the distributions of the variables by normalizing the mean and standard deviation of each column (variable). 40 The advantage of using this method is that it equalizes the prior importance of all the attributes. The second normalization method is principal component analysis (PCA), transforming the data into a smaller subspace where the new variables are uncorrelated with each other. 39 The PCA data transformation method deals with the redundancy of the data, and places emphasis on the variance of the data. The ability of each principal component to explain the data is measured according to the variance accounted for. Third, we have also fitted each model on the nonpreprocessed raw dataset, for comparison with the results of the two different scaling methods.
Machine Learning Regression Models. In this section, a summary of the regression models are presented; detailed explanations can be found in the Supporting Information.
Partial Least Squares Regression. The Partial Least Squares Regression (PLSR) model design is appropriate in a situation where there is no limit to the X variables or predictors, or where the sample size is small. Moreover, the PLSR model is also beneficial for analyzing strongly colinear and noisy data.
The goal of a PLSR model is to predict the output variable Y from the input variables X and to describe the structure of X. For this, PLSR finds a set of components from X that are relevant to Y; these components are known as latent variables. The intention of PLSR is to capture the information in the Xvariables that is most useful to predict Y. 41 A graphical representation is supplied in the Supporting Information Figure  S1(A).
Random Forest Regression. Random Forest (RF), a method for classification and regression analysis, has very attractive properties that have previously been found to improve the prediction of quantitative structure−activity relationship (QSAR) data. 42 An ensemble of many decision trees constitutes a random forest, and each is tree constructed using the Classification and Regression Trees (CART) algorithm. 43 The RF method is efficient in handling highdimensional data sets and is tolerant of redundant descriptors. Support Vector Regression. The main idea in Support Vector Regression (SVR) is to minimize the risk factor based on the structural risk minimization 44 from structure theory, to obtain a good generalization of the limited patterns available in the given data. First, the given data D are mapped onto a higher dimensional feature space, using the kernel function k(x i ,x j ) and then a predictive function is computed on a subset of support vectors. Here, we have used the radial basis kernel function (eq 7) to map the data onto a higher dimensional space. A graphical representation is supplied in the Supporting Information ( Figure S1(B)). SVR mapping on radial basis kernel function: Statistical Measures. To evaluate the performance of various machine learning models, we report two statistics: the root mean squared estimate (RMSE) and squared Pearson correlation coefficient R 2 (not to be confused with the coefficient of determination). 45 Formulas for these are given in the Supporting Information (eq S5). We have also assessed statistical significance using Menke Supporting Information (eq S6, Tables S3−S9 for R2, and Boxes S1−S3) for statistical significance). We also analyzed the variable importance for the RF method (see Table S17 in the Supporting Information). Variable importance was calculated in the CART program as implemented in R. 42b,48, 49 10-Fold Cross-Validation. In order to compute and compare the performance of the various regression models, we consider RMSE scores averaged over a 10-fold crossvalidation. 50 In the 10-fold cross-validation, the dataset is randomly split into 10 partitions, where the training set consists of 90% of the data and the test set consists of 10% of the data. A predictive regression model is fitted on the training set. The predictivity on the test fold is considered as an external measure to compute the accuracy of the fitted model. The entire process is repeated 10 times in order to cover the entire dataset, with each fold forming the test set on one occasion, and we record the average RMSE. The complete design of the workflow is represented in a flowchart (Scheme 2); similar workflows have been used for classification in other studies. 47, 51 The complete workflow of this analysis was written in R 52 using the CARET package; 53 all scripts are available in the Supporting Information.
In out-of-bag validation, one evaluates the performance of the model by separating training and test data through bootstrap sampling; this is convenient only for the RF method. It is not appropriate to compare RF out-of-bag predictions with other models such as PLS and SVR, which are not based on bootstrap sampling. So, we used 10-fold cross-validation to evaluate the performance of our various models.
10-Fold Cross-Validation for Parameter Tuning. For each model, we use 90% of the total data designated as the training set in order to find the optimum values for these parameters. We selected a range incorporating 20 different possible values for each model parameter, in order to select its best value. For each parameter, a further level of 10-fold crossvalidation is carried out in order to retrieve the RMSE of the models using each possible parameter value. Here, the training portion of 90% of the original data is further split into 10 new folds of 9%, with nine (81% of the original data) being used to build each model and one (9%) as an internal validation; this process of model building and internal validation is repeated to predict each of the 10 possible internal validation folds. This internal cross-validation step is repeated 20 times, once for each possible value of the parameter being assessed. Then, based on the value giving the lowest average RMSE score in the internal validation folds, the optimum parameter value is selected. Finally, the model is fitted on the complete training set of 90% of the original data using the selected parameter values.
Assessing the Final Models by 10-Fold Cross-Validation. The given 90%:10% split of the data into training and test sets was used to fit the final model for each fold of the main 10-fold cross-validation, once the optimum parameter values have been selected. The average RMSE and R 2 values over the 10 folds were considered in order to compare the usefulness of different descriptor sets and to evaluate the performance of the fitted models.
Dataset. The full DLS-100 dataset, with the experimental log S values, can be found as Supporting Information or downloaded from the Mitchell group web server (http:// chemistry.st-andrews.ac.uk/staff/jbom/group/Informatics_ Solubility.html; see the Supporting Information (csv_smiles_-SI.csv and Table S1)), which is consistent with the excellent suggestions from Walters. 54 The dataset includes CSD refcodes, Chemspider numbers, SMILES, experimental log S values and InChI for all molecules. The log S values in this work come from refs 2, 16, 18, and 19. Where possible, we have selected data obtained from the CheqSol method; where this was not available, we have selected reliable sources using different determination techniques. A good solubility prediction can be considered as a prediction of approximately the same error as that of the experiment. The experimental values have been shown in a number of previous papers to vary considerably. 55 Here, we consider the experimental accuracy limit to be between 0.6 and 1 log S unit (where 1 log S unit represents 5.7 kJ/mol at 298 K). Previous work has reported the experimental error in solubility prediction to be as great as 1.5 log S units and, on average, the error to be at least 0.6 log S units. 56 In 2006, Dearden 55a noted, as was later reiterated in the Solubility Challenge, that models with RMSE predictions of <0.5 log S units are likely to be overfitted. 16b,55a For a prediction to be useful, it must have an RMSE within the standard deviation of the experimental data; otherwise, a trivial prediction using the mean of the experimental data is a more accurate prediction of the log S value. 1 For the DLS-100 dataset, the experimental standard deviation is 1.71 log S units.

■ RESULTS AND DISCUSSION
We have compiled four sets of results for our DLS-100 dataset. First, a purely theoretical prediction, in which no machine learning is used and where predictions are made using only physics-based calculations. Second, theoretical energies are used as the sole descriptors in machine learning models. Third, cheminformatics descriptors, calculated using the CDK, are used as the sole input to machine learning methods. Finally, cheminformatics descriptors and theoretically computed energies are combined as input to machine learning methods. For each of these methods, we present the results and discussion, with comparison between the methods made on the basis of RMSE and R 2 (correlation coefficients for cheminformatics and combined models can be found in the Supporting Information (Tables S3−S9); RMSE values can be found in the Supporting Information (Tables S10−S16)). In addition to these results, we have replicated the solubility challenge using 2D molecular descriptors alone.
Theoretical Predictions. The theoretical methodologies described earlier utilize a thermodynamic cycle to access the free energy of solution. Table 1 shows the R 2 correlation coefficient and the RMSE for the predictions made by these methods. Chart 1 shows the linear fit to the data from the SMD(HF) method, which has the lower RMSE and the higher R 2 correlation coefficient of the two purely theoretical methods.
Chart 1 shows that the data are poorly explained by a linear model. The RMSE for the SMD(HF) method is nearly three times the suggested criterion of 1 log S unit of error. The situation is even worse for the SMD(M06-2X) method for it is possible to explain the underlying structure of these data using a general model, based on the predicted log S values, such a model will be inherently nonlinear. Compared with our previous work, 1 in which theoretical models provided a good prediction of log S, our theoretical methodology here differs only marginally, in the use of MP2 multipoles, and still produces good results (see Supporting Information (Chart S1 and Table S2)) for the same 25 molecules in this work (dataset DLS-25). The predictions for the additional 75 molecules alone show worse predictions than for the full 100-molecule set presented above (see Charts S2 and S3 in the Supporting Information). The additional 75 molecules therefore appear to form a more difficult dataset to predict. It is likely that improved results can be obtained from purely theoretical calculations, if some of the approximations made here are improved; for example, improved modeling of the solvated phase to more accurately describe the solvent and its effects on the solute could increase accuracy. Also, we note that the intramolecular degrees of freedom are neglected in the DMACRYS calculations, and further assumptions are made by using eqs 2 and 3 in the Methods section.
We subsequently applied machine learning methods to the theoretical energies in order to carry out nonlinear regression analysis. The average RMSE scores over 10-fold crossvalidation (see the Methods section for details) is represented as two-dimensional (2-D) column charts (see Charts 2 and 6). Different grayscale column bars represent the different machine learning methods used in this study. The standard deviation is shown as an error bar (black line).
Theoretical Energies as Sole Descriptors in Machine Learning. The use of the calculated energies as descriptors in the machine learning models yields considerably improved results, compared to those from the predictions made without machine learning. The results now, while still missing the 1 log S unit error criterion, do make useful predictions in which the RMSE is within the standard deviation of the experimental data (1.71 log S units). The RF and SVR models produce notably better results than PLS. Charts 2 and 3 show that the method minimizing the RMSE (1.21 log S units) is RF with HF when scaled with PCA.
Cheminformatics Descriptors as the Sole Input to Machine Learning. An additional point of interest is that the chemical descriptors alone using RF or SVR can provide a marginally better prediction of log S than the machine learning methods with only the energies as descriptors. In particular, we noticed that fitting the RF model on data that are scaled to a given mean and standard deviation produces a statistically significant improvement in its prediction with cheminformatics descriptors alone rather than theoretical energies (see the Supporting Information (Boxes S1−S3)). In all other cases, the changes are not significant. This suggests that slightly more useful information about the molecules' log S values is conveyed by the cheminformatics descriptors than by the theoretical energies alone (see Chart 4).
Theoretical Energies and Cheminformatics Descriptors as Input to Machine Learning. When the descriptors and energies are combined as input for the machine learning methods, we obtain results that are generally only very slightly better than those obtained from cheminformatics descriptors alone. This implies that the theoretical energies contain very little extra useful information not already present in the descriptors. The joint results do present a statistically significant improvement for PLS and RF, once scaled by the mean/ standard deviation, compared to those for the theoretical energies alone. In light of this, and given that the descriptors alone produce a marginally improved result compared to chemical theory, it is fair to say the cheminformatics descriptors are seen to contain a modest amount of additional information not incorporated in the theoretical energy terms. This suggests that the 123 descriptors of the cheminformatics descriptors and the 10 theoretical energy descriptors convey similar information, with only a small amount of additional information being conveyed by adding the descriptors to the energies and almost no information gained by adding the energies to the set of descriptors. We can conclude that these two sets of features are not generally complementary. Interestingly, the best result in terms of RMSE is from the descriptors with the M06-2X energies, which, on their own, produced the worse of the two pure theory results in this work (see Charts 5 and 6). The RF model performs particularly well over all descriptor sets, even without any type of scaling, the best RMSE result being only 0.13 units outside the 1 log S unit target. The best single prediction, in terms of the RMSE, was made by the PLS model, using descriptors and the M06-2X energies scaled by the standard deviation and the mean, with an RMSE of 1.11 ± 0.04 log S units. All of these methods make predictions inside the standard deviation of the experimental data; therefore, all of the predictions are useful. We also note that the RF model shows small but statistically significant improvements with all scaling methods (using the theoretical energies and cheminformatics descriptors combined) when compared to some models trained on the theoretical energies only (see Supporting Information (Boxes S1−S3)). This is the only model to show such improvements with all scaling methods in the present work.
We analyzed the relative variable importance (see Table S17 in the Supporting Information) and found that X log P (from ref 57) was consistently rated as the most important feature. X log P is a computed estimate of the base-10 logarithm of the octanol:water partition coefficient (the ratio of concentrations of solute solvated in the two different solvents). This has been seen in many previous studies and is not so surprising given that it provides information specifically about the solvated phase. 4,56 X log P uses an atom additive model for the prediction of log P. In the Supporting Information, we include tables (Table S17 in the Supporting Information) displaying the 10 most important descriptors; here, we will briefly comment upon these. We find Kier and Hall's χ path and chain indices 58 to be of importance; these quantify the degree of bonding to heavy atoms within a given path or chain length. In addition, the Moreau−Broto autocorrelation, 59 which describes the charge and mass distribution along a given path length, is found in the top 10. Finally, we also note Randic's weighted path descriptors, 60 which are used to account for molecular branching. Once the theoretical energies are added to the descriptor set, the free energies of hydration and solution are ranked in the top 10, along with the theoretical log S prediction. Explanations of the molecular descriptors used in this work can be found in ref 61.
Chemically, we can see logic in the most important descriptors. One may expect that molecular branching would play an important role, because it gives information on the extent and flexibility of the molecule, hence contributing some entropic information. Coupling this descriptor with the Kier Hall descriptors, information can be acquired on the composition of such chains, in terms of heavy atoms. The autocorrelation descriptor provides charge and mass distribution information. Again, here, information is imparted concerning the distribution of heavy atoms and electronic factors. For example, the degree of charge separation across a molecule and the localization of charges are important factors in determining particularly enthalpic but also entropic contributions. The theoretical energies in the top 10 are all closely related quantities; it is not surprising that the (purely theoretical) prediction of log S is found in the top 10: since this is the quantity we are trying to predict, it is expected to provide sufficient information to the model to be found in the top 10. The free energies of solution and hydration provide direct information from electronic structure theory and statistical thermodynamics on the interactions of a given molecule, in a given conformation, within its environment, and on the energetics of phase transitions.
As a benchmark, we also present our method's predictions of the solubility challenge set based solely on cheminformatics descriptors (see Table 2). As suitable crystal structures are not available for all molecules in the solubility challenge, we could not calculate the theoretical energies. Tables 2 and 3, and Chart 7, demonstrate that our method can make predictions for the solubility challenge dataset within the coveted 1 log S unit RMSE error and, in fact, makes predictions that are consistent with some commercially available methods and deep-learning methods. A recent publication 56 reported RMSE scores of 0.95 log S units 56 for the commercially available package MLR-SC 62 and 0.90 log S units for a deep-learning method. 56 However, these results are not directly comparable with ours, for two reasons. First, our results have been calculated for a 10-fold cross-validation and for the canonical training:test split (see Tables 2 and 3). Second, the deep-learning result (RMSE = 0.90) given by Lusci et al. 56 is contingent on correcting eight putative errors in the CheqSol solubility data, the most substantial of which is for indomethacin, a compound that has been shown to hydrolyze under alkaline conditions. 63 While we have corrected names and SMILES for the solubility challenge set, we have not adjusted any solubility values therein. It is also reasonable to suggest that, using the solubility challenge set as a benchmark, our 100-molecule set could be considered as a "difficult set", given the improved prediction offered by our method when the solubility challenge set is used instead.

■ CONCLUSIONS
Our current work shows that accurate solution free energies are not calculable via the simple theoretical procedure that we present here. A significant portion of the important physics in the solution process is not captured using the approximate methodologies that we utilize in this work. This reaffirms that, currently, QSPR methodologies are the most-accurate and time-efficient methods for accurate solution free energy predictions. In addition, we show that state-of-the-art machine learning methods, with a modest number of cheminformatics descriptors, are capable of making solution free-energy predictions that are consistent with those of commercially available programs and newer deep-learning approaches. Here, theoretical energies and cheminformatics descriptors are generally shown to not be complementary for such predictions. Since both sets of descriptors (theoretical energies and cheminformatics descriptors) produce a similar level of accuracy when used alone in the machine learning methods, and little improvement is seen when they are combined, we can conclude that the information conveyed is of a similar nature and that the theoretical energies are, for this reason, a more efficient form of information storage, as 10 descriptors contain equivalent information to 123 molecular descriptors. However, in terms of time, the molecular descriptors are much less expensive to calculate and their use is therefore more timeefficient. Additionally, we note that the RF method has produced promising predictions in this work, with relatively low RMSE. This method has consistently produced good results and would be our recommended method to make solubility predictions.

* S Supporting Information
Informatics_Solubilty_datasets_and_scripts.zip, including R codes, Bash scripts, Python scripts, macro (.xlsb), DLS-100.csv and Solubility_Challenge_dataset.xlsx. DLS-100.csv contains experimental log S values, references, SMILES, sources of smiles, CSD refcodes, molecules names, InChI and Chemspider numbers. SI_document.pdf: Structure data, 2D images of the molecular structures, experimental log S values, CSD refcodes, R 2 , statistical significance, variable importance. This material is available free of charge via the Internet at http://pubs.acs.org. All scripts and datasets used in this work are available for download from the Mitchell Group web server (http://chemistry.st-andrews.ac.uk/staff/jbom/group/ Informatics_Solubility.html, as well as in the Supporting Information.

Author Contributions
The manuscript was written through contributions of all authors. All authors have given