Is Experimental Data Quality the Limiting Factor in Predicting the Aqueous Solubility of Druglike Molecules?

We report the results of testing quantitative structure−property relationships (QSPR) that were trained upon the same druglike molecules but two different sets of solubility data: (i) data extracted from several different sources from the published literature, for which the experimental uncertainty is estimated to be 0.6−0.7 log S units (referred to mol/L); (ii) data measured by a single accurate experimental method (CheqSol), for which experimental uncertainty is typically <0.05 log S units. Contrary to what might be expected, the models derived from the CheqSol experimental data are not more accurate than those derived from the “noisy” literature data. The results suggest that, at the present time, it is the deficiency of QSPR methods (algorithms and/or descriptor sets), and not, as is commonly quoted, the uncertainty in the experimental measurements, which is the limiting factor in accurately predicting aqueous solubility for pharmaceutical molecules.


■ INTRODUCTION
Interest in the prediction of solubility by quantitative structure−property relationships (QSPRs) has risen dramatically in recent years. 1−5 Currently, the state-of-the-art tool allows the prediction of solubility with root-mean-square errors (RMSEs) of approximately 0.3−0.4 log units for simple organic molecules and 0.7−1.0 log units for drug molecules. 6 One frequently cited reason for the difficulty in predicting solubility for drug molecules is that published methods are derived from data taken from multiple sources from the literature, for which RMSEs in experimental data have been estimated to be 0.6−0.7 logS units. 6 The implicit assumption is that if existing models could be retrained and tested upon more accurate data, then the predictive error would decrease. Until recently, it has, however, not been possible to test this hypothesis, due to the absence of a definitive "gold-standard" data set containing accurate solubility data for fully characterized drug molecules.
There are many different definitions of aqueous solubility in common use in the published literature, but the majority of QSPR models have focused on the prediction of intrinsic aqueous solubility, which is the property we consider here. The intrinsic aqueous solubility of an ionizable molecule is defined as the concentration of the unionized molecule in saturated aqueous solution at thermodynamic equilibrium at a given temperature. 7,8 It is used to calculate dissolution rate and pH-dependent solubility in models such as the Noyes−Whitney equation 9 and the Henderson−Hasselbalch equation, 10,11 respectively. There has been great interest in predicting the intrinsic aqueous solubility of bioactive molecules in the biochemical sciences because it is a key determinant in the bioavailability of novel pharmaceuticals and the environmental fate of potential pollutants. 12−14 Recently, the intrinsic aqueous solubilities of 132 drug molecules have been measured by the CheqSol method. 15,16 The data are highly reproducible and standard errors for ten repeat assays (i.e., the random errors between repititions of the same experiment) are typically around 0.05 log units (referred to mol/L). CheqSol measurements of solubility have been shown to be very reliable, with very good agreement reported for experiments carried out in different laboratories in different countries and continents. The results also agree very well with carefully performed shake-flask experiments carried out under the correct conditions (which are not always the conditions used for data reported in the published literature). The CheqSol data were originally published as part of a blind challenge to predict solubility. 15,16 Although this exercise provided a useful benchmark for the field, the results of the blind challenge were difficult to interpret in the scope of the questions addressed here since entrants were not asked to provide details of the methods they used or of any additional experimental data they employed in making the predictions.
Here, to test the hypothesis that experimental data quality is the limiting factor in predicting aqueous solubility, we develop QSPRs from the CheqSol solubility data set and compare these with models derived from literature solubility data for the same molecules. The models are derived by Random Forest regression, which has a number of attributes that make it useful for this purpose. First, a Random Forest model trained on the molecular descriptors used here has been shown to perform well in comparison to other published methods for the prediction of solubility. 17,18 Therefore, the models may be considered to be representative of QSPR models in the literature. Second, Random Forest is a well-defined algorithm for QSPR model building that does not overfit, which means that its use eliminates the influence that inconsistent modeling procedures would have on the experiment. To provide a comparison to the Random Forest model, we also implement a regression model against experimental melting point and logP, which may be considered to be a parametrized model based on the general solubility equation. 19 The results are important as a benchmark of the prediction of solubility by QSPR methods but may also be considered to be illustrative of the effect that noise (in the dependent variable) has on QSPR models and what this means when attributing physical significance to QSPR variables. The work is also timely given the recent development of molecular simulation methods to predict the solubility of druglike molecules, 20,21 which, although currently more computationally expensive and less accurate than QSPR models, may in the future offer an alternative to QSPR models in some applications, depending on how both fields develop. 22−24 ■ METHODS Data Sets. For each of the 85 drug molecules in the data set, two solubility values were obtained. First, the thermodynamic solubility of the nonionized form (intrinsic solubility) at 298 K was determined by the CheqSol method ("CheqSol Data set"). Second, an experimental intrinsic solubility value at 298 K was taken from the literature ("Literature Data set").
CheqSol Data Set. The intrinsic aqueous solubilities of 132 molecules were published by Llinas et al. as part of a recent blind challenge to predict solubility. 15,16 In the current work, we consider only a subset of this data because solubility data for only 85 of these molecules were found to be available from other sources in the published literature. The intrinsic aqueous solubility was measured for all 85 compounds by Llinas et al. using the CheqSol method. 16,25,26 For indomethacin, which was observed to hydrolyze in the original assay, we took the revised solubility data point reported by Comer et al. 26 In each CheqSol assay, intrinsic solubility was measured 10 or more times. In addition, the complete CheqSol assay was repeated multiple times starting from separate samples in different vials. The solubility or standard error of each molecule in the data set is reported as a statistical result across all separate CheqSol assays ( Table 1). The standard errors for the measured intrinsic solubilities were typically less than 0.05 logS (referred to mol/ L). For five molecules in the data set, a change in polymorphic form was reported to occur during the solubility assay. In each case, the final polymorphic form was reported to be stable to repeated cycling between sub-and supersaturated states, and the solubility of this polymorph was used in the QSPR analyses. The ratios of the solubilities of the observed polymorphs are discussed in the results section.
Literature Solubility Data. Data for intrinsic solubility in water at 25°C were obtained from the literature for each of the 85 molecules. For 26 of the 85 molecules, more than one value for solubility was found in the literature and an arithmetic average was taken. Of these molecules, three had a standard deviation between 0.5 and 1 log solubility units (diethylstilbestrol, indomethacin, and propanolol) and the remainder had standard deviations less than 0.5 log solubility units. Solubility data and references are provided in Table 1.
The literature solubility data set was compiled from 123 experimental solubility values taken from 12 sources from the literature. In general, the data are taken from well-known QSPR solubility data sets; the majority comes from papers on the prediction of solubility by Bergstrom et al. (35%), 27 Wassvik et al. (11%), 28 and Rytting (28%). 29 The remainder of the data are taken from other QSPR papers and well-known databases. Accurate measurement of intrinsic aqueous solubility requires that thermodynamic equilibrium is established and several factors are controlled, including purity of the solute and solvent, temperature, physical form of the precipitate, and solution pH and ionization state. Although care was taken to select intrinsic aqueous solubility from reliable sources in the literature (and those that are representative of good QSPR data sets), mistakes in experimental methodology and reporting of data have undoubtedly introduced some unidentified errors, which may mean that some of the literature data do not correspond perfectly to intrinsic aqueous solubility (they may be kinetic or total aqueous solubility). The problems with literature solubility are widely known and have been well discussed by many other authors. 30,31 Simple statistics of the distribution of both the literature and the new experimental data set are provided in Table 1. The literature and CheqSol solubility data are plotted against each other in Figure 1.
Melting Point Data and logP Data. For each molecule, experimental melting point and logP data were also obtained from the literature. All data and references are given in Table 1.
QSPR Models. Two models were built for both new experimental and literature solubility data: (i) a Random Forest model using all 2D and 3D descriptors; (ii) a multilinear regression equation using two variables: experimental melting point and experimental logP. The Random Forest model is similar to that used in a previous study on the prediction of solubility. 17 The multiple linear regression equation is a reparameterization of the general solubility equation and is included for comparison.
For each molecule, molecular structure files were taken from PubChem and were checked using SciFinder. A single low energy molecular conformer was selected by a low-mode conformation search using the MMFF94s force field with a Generalized Born Surface Area (GB/SA) model for water as solvent (as implemented in MacroModel). 51 Both 2D and 3D molecular descriptors were calculated from the lowest energy conformer using the Molecular Operating Environment (MOE) software. 52 The MOE software provides a method for predicting logS, which when compared to our new experimental solubility data for all 85 molecules gave a rather unsatisfactory, r 2 = 0.27, RMSE = 1.05 logS units, and bias = −0.36 logS units (where S is referred to units of mol/L). These predicted solubility values were excluded from further analysis and were not used in model building.
The 2D descriptors included calculated physical properties (logP, molar refractivity), charged surface properties (from Gasteiger−Marsili PEOE charge distributions on VDW surfaces), constitutional descriptors (counts of atoms and functional groups), connectivity and topological indices (including the chi, Kier−Hall, kappa, Wiener, and Balaban indices), and hydrogen bonding propensities (numbers of hydrogen bond donors and acceptors). The 3D descriptors included energy terms (total potential energy and contributions of angle bend, electrostatic, out of plane, and solvation terms to the molecular mechanics force-field energy), molecular shape descriptors (water accessible surface areas), volume descriptors, and surface area descriptors. These descriptors have previously been shown to be successful for the prediction of solubility. 53 All of the descriptors used in this work are listed in the Supporting Information. Each model was trained upon a subset of the data set (n = 60) and then validated by both crossvalidation on the training data set and by prediction of the remaining 25 molecules. To reduce the influence of the choice of training and test split, this procedure was repeated 20 times (i.e., 20 different random partitions; a method sometimes referred to as Monte Carlo cross-validation), and statistically averaged results are reported. All comparisons were made on a like-for-like basis, i.e., the same randomly selected training and test sets and the same cross-validation folds were used for each pair of Random Forest or multilinear regression models trained upon literature solubility and new experimental data. Thus, the results for row 1 in Table 4 (or Table 7) and row 1 in Table 5 (or Table 8) are directly comparable, and likewise for rows 2, 3, 4, etc. Three statistics are reported to assess the accuracy of each regression model: (1) where index i runs through the set of N selected molecules, and y i and y exp i are the calculated and the experimental values, respectively, for molecule i for the given property (i.e., log 10 S). A parentheses nomenclature is adopted to indicate whether the results refer to fit-to-the-training data (tr), 10-fold crossvalidation (cv), out-of-bag validation (oob), or prediction of the test set (te).
Random Forest is robust to overfitting and can be used without optimization of the training parameters, 54 which allows the results for the two data sets to be compared without concern that the results are influenced by modeling errors. Each Random Forest was trained with the parameters of ntree = 500, nodesize = 5, and mtry = (total number of descriptors)/3, which have been used successfully in previous QSPR models to predict solubility. 17,18 Comparison between Experimental and Literature Solubility Data. Figure 1 shows the correlation diagram between the log solubility data taken from the literature against the new experimental data (r 2 = 0.69, RMSE = 0.68 logS units, σ = 0.66 logS units, and bias = −0.18 logS units). Table 2  Accurate measurement of intrinsic aqueous solubility requires the careful control of many factors, including purity of the solute and solvent, temperature, physical form of the precipitate, and solution pH and ionization state. In Figure 1, the predominance of molecules whose literature solubility values are higher than our measured intrinsic values suggests that some of these literature data points may correspond more closely to total aqueous solubilities than intrinsic solubilities (i.e., the ionized form of the molecule may contribute to the measured solubility). The literature solubility values that are lower than the measured intrinsic values may indicate that in the former a different crystalline form (e.g., a hydrate) was present at thermodynamic equilibrium. The inaccuracies in literature solubility caused by both experimental and reporting errors have been well discussed previously. 8,30 The CheqSol method is designed to measure specifically the intrinsic aqueous solubility. The crystalline form of the precipitates identified in the CheqSol experiments were characterized by thermogravimetric analysis, differential scanning calorimetry, and powder and single-crystal X-ray diffraction.
Polymorphism. The cycling between subsaturated, saturated and supersaturated states that is inherent to the CheqSol method was reported to cause a polymorphic change for five molecules during the solubility assay. 15,25 This polymorphic change was first evident as a change in solubility and was verified by repeating the assay in order to isolate and fingerprint the precipitate by powder X-ray diffraction (PXRD). The final polymorphic form was reported to be stable to repeated cycling between saturated and subsaturated states in each experiment. In Table 3, the ratios of the average solubilities of each observed polymorph to that of the least soluble polymorph for that molecule are reported. For each molecule, the solubility is observed to decrease as consecutive polymorphs are formed, which is in agreement with Ostwald's "Law of Stages" that states that metastable polymorphs are often observed to form prior to more thermodynamically stable ones. 55 The largest ratio of the solubility of observed polymorphs is for diflunisal, for which form I is 89 times more soluble than form IV, a surprisingly large difference in solubility. Unfortunately, although the result for diflunisal was reproducible, it was not possible to solve the three-dimensional crystal structure from the PXRD patterns, which prevented further analysis of the structural differences between forms I and IV. The remaining four molecules in the data set have an average polymorph ratio of 3.45, which provides further evidence that diflunisal is an uncommon example. It has previously been reported, based on a statistical analysis of known crystal structures and existing solubility data, that the average difference between the solubility of polymorphs of druglike molecules is approximately 2-fold. 56 Here the observed average polymorph solubility ratio is 3.45; this may be an artifact caused by the use of a small data set (only five molecules with solubilities for more than one polymorph).

■ RESULTS
We have used two different methods, (i) Random Forest regression with a combination of 2D and 3D molecular descriptors and (ii) multilinear regression using only experimental melting point and logP values, to model the 85 molecule data set for both literature solubility data and new solubility data.  The results for each of the 20 Random Forest regression models derived from different random partitions of the data set are presented for literature solubility data in Table 4 and for new experimental data in Table 5. The statistically averaged results are presented in Table 6.      (Tables 5 and 6). By comparison to the Random Forest models, the multilinear regression models against experimental values of melting point and logP were less accurate for the prediction of solubility. The    Tables 7 and 8, respectively. The statistically averaged results are presented in Table 9. For the literature solubility data set, the averaged values for crossvalidation were r 2 (cv) = 0. 29

■ DISCUSSION
The observation that QSPR models for the prediction of solubility do not improve when trained and tested upon experimental data that are obtained under standardized conditions is surprising and is in disagreement with the assumptions made by other authors. Before this conclusion can be accepted, it is necessary to discuss potential sources of error in this experiment. First, the work is only valid if the experimental data are more accurate than the literature data set. The average standard errors for the measured solubility data are typically <0.05 log solubility units, which is calculated as the standard deviation of multiple independent measurements made using the CheqSol method. The standard error is an indication of the reproducibility of the solubility measurements and does not preclude a systematic error. It is difficult to guarantee the accuracy of the data because there is no separate "goldstandard" set of solubility data with which to compare the results. However, the CheqSol results have proven to be consistent between different laboratories and in tests with carefully measured shake-flask results. Since QSPRs are empirical models, the presence of a systematic (rather than random) error would not necessarily imply that an accurate structure−property relationship could not be derived.
Second, the results may have been deleteriously affected when the random partitioning into training and test sets meant that the models had to extrapolate in logS. In Tables 5 and 8, the errors for the extrapolative predictions of the test set (marked with an a, b, or c) are observed to be of the same order as the nonextrapolative predictions; therefore, this possibility may be rejected.
Third, the regression model might not be optimal because the data set is small and diverse. Without measuring additional solubility data this is difficult to investigate, but a simple experiment was carried out to see whether the Random Forest out-of-bag cross-validation results would converge with data set size. Random Forest models were retrained on data sets of size 10,20,30,40,50,60,70, and 80 molecules, taken from the full data set (n = 85) with experimental solubility values. The selection of each data set was made at random, and as before, the results were averaged over 20 different random selections for each size of data set. A plot of average RMSE(oob) in logS units against the size of the data set is shown in Figure 2. The predictive accuracy is worst when the data set size is very small, as would be expected based upon the statistical averaging that is inherent to Random Forest and because of the general problems with working with small data sets. The average RMSE(oob) in logS units converges at around 60 molecules, which suggests that the size of the data set used in this work is acceptable. Furthermore, when leave-one-out cross-validation was carried out against the complete data set (n = 85) the results were similar for both the new experimental solubility data (q 2 = 0.51, RMSE = 0.86 logS units, and bias = 0.02 logS units) and the literature solubility data (q 2 = 0.49, RMSE = 0.82 logS units, and bias = 0.00 logS units), which suggest that the size of the data set is acceptable. It should be noted that the results illustrated in Figure 2 are dependent upon the diversity of the molecules that are selected (and the physical property of interest). Therefore, these results should not be interpreted as indicating that 60 molecules is an acceptable or minimum data set size for all QSPRs.
Fourth, QSPR models make a prediction of solubility from molecular structure (without knowledge of the crystal packing); therefore, it is reasonable to expect that the error might be due to the difference in solubility between polymorphs. However, the average difference in solubility between polymorphs has been measured to be 2-fold, 56 whereas the error in models to predict solubility is approximately 5-to 10-fold. Therefore, it would seem unlikely that this is the sole reason why the accuracy of QSPR models does not improve when trained upon accurate experimental data.
Our conclusion from this work is that experimental error in literature solubility data is not the limiting factor in predicting aqueous solubility. The predictive errors are similar for models constructed from both new solubility data and data extracted from the literature, even though the latter are known to contain experimental errors of the order of 0.6 to 0.7 log units. An interesting conclusion that can be drawn from this observation is that QSPR models are adept at modeling noise. This might suggest that caution should be exercised when attributing physical significance to QSPR variables.
The obvious question is, "how can QSPR models for the prediction of solubility be improved?" Various authors have demonstrated that the prediction of solubility of liquids and simple nondruglike organic molecules is possible with RMSEs of approximately 0.3 log solubility units. 4 Therefore, the answer may relate to the added complexity of modeling solubility for solid drug molecules. The solubility of a solid drug molecule depends upon the energy required for removing molecules from the crystal lattice as well as the energy gained by solvation. Including these solid-state effects into models to predict solubility is unlikely to be completely successful based solely on single molecule properties because solid−solid interactions also depend upon the geometrical arrangement of molecules in the crystal lattice. The prediction of melting point for drug molecules exemplifies this problem, where the average RMSE for prediction is 40−50°C, 18,57 even though the average experimental error is probably less than 5°C. The most complete solution to this problem might involve the ab initio prediction of crystal structure, but despite recent advances, this remains a major challenge for drug molecules. 58 However, we note that it might not be necessary to know the correct polymorphic form, but only a plausible low energy polymorph, because average differences in solubility between polymorphs (2-fold) are considerably lower than average errors in QSPR models to predict solubility (5-to 10-fold). 59 This suggests that it might be possible to improve existing QSPR models by incorporating lattice energy terms calculated from a simulated or best-guess crystal form. Furthermore, separating crystal packing energy from solvation energy in QSPR models might permit a simpler linear regression model to be developed, which might alleviate the problems noted by Lipinski that solubility becomes more difficult to predict for complex molecules because it depends upon multiple intercorrelated factors. 60 In the long term, a more satisfactory solution to this problem might be to calculate solubility from first-principles using molecular simulation. Although there have been some significant recent advances in this field, 20,21,59,61 further development work is required before these methods become widely used in pharmaceutical research and development. 62,63 Another consideration is that errors in the calculation of logP may contribute to errors in models to predict solubility. However, this conclusion is not supported by the results. First, it is observed that the error in the multilinear regression model is large despite being derived from experimental values of logP and melting point (this observation also implies that melting point and logP are not ideal descriptors to quantify the free energy changes associated with breaking cohesive interactions in the crystal and hydrating the free solute molecules; such contributions to the free energy of solution can be expressed more rigorously as sublimation and hydration free energies by a thermodynamic cycle via the gas-phase 21,59 ). Second, when the Random Forest model was retrained using experimental logP values, rather than calculated logP values, the results did not improve.

■ CONCLUSIONS
We conclude from this work that existing QSPR methods for modeling solubility data do not improve when trained upon experimental data that is obtained under standardized conditions. Furthermore, QSPR models are adept at modeling noise, which suggests that caution should be exercised when attributing physical significance to QSPR variables. The results suggest that further work is required to develop novel QSPR methodologies that are more accurate and more reliable.