Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity

Lipophilicity, as measured by the partition coefficient between octanol and water (log P), is a key parameter in early drug discovery research. However, measuring log P experimentally is difficult for specific compounds and log P ranges. The resulting lack of reliable experimental data impedes development of accurate in silico models for such compounds. In certain discovery projects at Novartis focused on such compounds, a quantum mechanics (QM)-based tool for log P estimation has emerged as a valuable supplement to experimental measurements and as a preferred alternative to existing empirical models. However, this QM-based approach incurs a substantial computational cost, limiting its applicability to small series and prohibiting quick, interactive ideation. This work explores a set of machine learning models (Random Forest, Lasso, XGBoost, Chemprop, and Chemprop3D) to learn calculated log P values on both a public data set and an in-house data set to obtain a computationally affordable, QM-based estimation of drug lipophilicity. The message-passing neural network model Chemprop emerged as the best performing model with mean absolute errors of 0.44 and 0.34 log units for scaffold split test sets of the public and in-house data sets, respectively. Analysis of learning curves suggests that a further decrease in the test set error can be achieved by increasing the training set size. While models directly trained on experimental data perform better at approximating experimentally determined log P values than models trained on calculated values, we discuss the potential advantages of using calculated log P values going beyond the limits of experimental quantitation. We analyze the impact of the data set splitting strategy and gain insights into model failure modes. Potential use cases for the presented models include pre-screening of large compound collections and prioritization of compounds for full QM calculations.


1) Distribution of measured log P values
Figure S1 Distribution of measured experimental values for the in-house and public dataset. When no filters on qualifiers ("<"/">") are used (left plot), the peak at 4.7 is clearly visible. Applying this filter (middle plot) removes this artefact of the dataset. Approximately 12% of compounds were filtered out due to qualifiers on the measurement.

2) Dataset splitting
For the random splitting, molecules were chosen randomly and assigned to the respective dataset splits. The scaffold-based split was obtained by grouping molecules by their Bemis-Murcko scaffolds 1 based on code from Chemprop 2,3 . Statistics for the individual subsets of the public dataset are shown in Table S1. Clusters of molecules sharing the same Bemis-Murcko scaffold were randomly assigned to the respective dataset splits. For the time-based split, the newest (as determined by internal registration date) 20% of molecules were assigned to the hold-out test set. A five-fold cross-validation scheme (rolling basis) on time series data 4 was used for the hyperparameter optimization, training on increasingly longer time periods while always validating on the subsequent (later registered) 20% of the cross-validation set. In this way, models are only trained on previous data and used to predict later data.  [2,4,6,9999] (9999 results in a fully-connected graph with all interatomic distances)

Figure S2
Hyperparameter optimization results for Chemprop3D model (conformer choice based on lowest GFN2-xTB energy in water). Public dataset, random split. Panels show mean absolute error on the validation set during five-fold cross validation, plotted against the hidden size for message passing layers. Cutoff distance for the 3D graph construction is held constant for each panel and increases between panels from left to right. Shaded regions show standard deviation of the error across five folds.
S4 Figure S3 Hyperparameter optimization results for Chemprop3D model (conformer choice based on lowest GFN2-xTB energy in water). Public dataset, scaffold split. Panels show mean absolute error on the validation set during five-fold cross validation, plotted against the hidden size for message passing layers. Cutoff distance for the 3D graph construction is held constant for each panel and increases between panels from left to right. Shaded regions show standard deviation of the error across five folds.

4) Impact of conformer choice on Chemprop3D model performance
Since the cutoff distance for the 3D graph construction was used as an optimization hyperparameter for the Chemprop3D model, it gives some insight into whether this architecture can benefit from 3D information for this learning task. Chemprop3D models were also trained using the lowest GFN2-xTB 5 energy in wet octanol or the lowest DFT energy in water as the conformer selection criterion. For these models, the best-performing hyperparameter setup for models trained on conformers with the lowest GFN2-xTB energy in water was utilized.

Figure S4
Test set errors for the public dataset with different splitting strategies. Error bars show 95% confidence intervals. 6 All models are using hyperparameters obtained from the cross-validation hyperparameter screening using lowest GFN2-xTB energy in solvent as the criterion for conformer choice.      Figure S5 Predicted-vs-calculated plots for test set molecules. In-house dataset, random split.

Figure S6
Predicted-vs-calculated plots for test set molecules. In-house dataset, scaffold split.

Figure S7
Predicted-vs-calculated plots for test set molecules. In-house dataset, time split.

Figure S8
Predicted-vs-calculated plots for test set molecules. Public dataset, random split.

8) log P prediction for peptides
SMILES for the set of LIPOPEP compounds were extracted from the ESI of the reference publication. 7 While the publication reports RMSE separately for the cross-validation and test set ("external validation"), this split is not provided in the ESI and we test our models (trained on the in-house data) on the entire LIPOPEP dataset after filtering steps. Note that no re-training of our models was performed. We exclude compounds which the authors denote as "ionizable" and further remove charged or zwitterionic compounds as described in section "Methods". No further SMILES preprocessing (tautomer selection etc.) was performed for this analysis. Model predictions are made using the Chemprop 2,3 model trained on inhouse data using a scaffold-based split. Table S6 and Figure S15 show prediction results.  Figure S15 Predicted-vs-calculated plots for peptide molecules. Chemprop models trained on in-house data using scaffold-based split.

9) Experimental uncertainty for in-house dataset
Figure S16 Histogram of absolute differences between pairs of compounds with multiple experimental log P measurements for in-house dataset. In case of >2 measurements, all pairwise combinations are considered (1518 total). Average deviation between all pairwise differences is 0.18 log units.    Figure S21 Distribution of mean Tanimoto similarity (radius 2 Morgan Fingerprints, count-based, 2048-bit) between compounds of the public (left) and in-house (right) datasets and their 5 nearest neighbors within the respective datasets. Public dataset shows mean similarity of 0.5666 ± 0.1636, in-house dataset 0.7069 ± 0.1196 (mean ± 1 std. deviation).

15) Duplicates between datasets & label consistency
There are 29 duplicate compounds (same canonical SMILES) between the public and inhouse datasets which were analyzed for the consistency in their labels, both for experimental and ReSCoSS-calculated log P values ( Figure S23). For experimentally determined values, most compounds show only small deviations between both datasets. We attribute these to experimental uncertainty and differences in the measurement protocol and equipment used in the collection of both datasets. The few larger deviations could be caused by experimental errors. In the case of ReSCoSS-calculated values, differences in the labels are at the level of numerical accuracy (mean difference 0.0022).

Figure S23
Distribution of difference in labels between the inhouse and public dataset, for experimental (left) and ReSCoSScalculated (right) log P values.