Predicting drug properties with parameter-free machine learning: pareto-optimal embedded modeling (POEM)

The prediction of absorption, distribution, metabolism, excretion, and toxicity (ADMET) of small molecules from their molecular structure is a central problem in medicinal chemistry with great practical importance in drug discovery. Creating predictive models conventionally requires substantial trial-and-error for the selection of molecular representations, machine learning (ML) algorithms, and hyperparameter tuning. A generally applicable method that performs well on all datasets without tuning would be of great value but is currently lacking. Here, we describe pareto-optimal embedded modeling (POEM), a similarity-based method for predicting molecular properties. POEM is a non-parametric, supervised ML algorithm developed to generate reliable predictive models without need for optimization. POEM’s predictive strength is obtained by combining multiple different representations of molecular structures in a context-specific manner, while maintaining low dimensionality. We benchmark POEM relative to industry-standard ML algorithms and published results across 17 classifications tasks. POEM performs well in all cases and reduces the risk of overfitting.

Molecular representations are a key component for predictive modeling of chemical activity. Different representations can be more or less relevant to different predictive tasks, not unlike the context-dependent use of different molecular representations in a laboratory. For example, acetylsalicylic acid can be described by several different common names: IUPAC name, chemical formula (C 9 H 8 O 4 ), a simplified molecular-input line-entry system (SMILES) string (eg. "O=C(C)Oc1ccccc1C(=O)O") ( 6 , 7 ), or a drawing of a 2-D structure (eg. as in Figure 1). Each of these representations is valid, but describe acetylsalicylic acid with differing levels of information content. For example, Acetozone has the same chemical formula as acetylsalicylic acid, but not the same 2-D structure; this is consistent with the observation that the chemical formula is a higher entropy description of a molecule than a drawing of its 2-D structure. For most ML applications, the molecular representation is defined in a way that is much simpler for a computer to extract useful information from, such as a binary or numeric vector (eg. Figure 1E).
Physicochemical molecular descriptors are one conceptually simple way to create such a vector: by measuring a series of known properties (eg. mass, number of heteroatoms, charge, etc.) ( 8 ). While physicochemical descriptors have favorable interpretability, these overly simplistic representations of a molecule can lead to poor predictive power or a host of other problems associated with incorrect descriptor selections ( 9 ). Thus, an approach relying on physicochemical molecular descriptors may be robust for a single predictive task, but often will not be generalizable.
Alternatively, molecules may be converted into a vector format in a process called molecular fingerprinting ( 10 ). Unlike physicochemical molecule descriptors, fingerprints do not need to have any human-obvious relationship to the properties of the molecule; they are generated by following strict algorithmic rules that associate individual positions in the vector to the presence/absence of specific substructures and substructure relationships. A primary advantage gained by using fingerprints, as opposed to purely physicochemical molecular descriptors, is that they are usually much more generalizable; a fingerprint of a molecule will often be useful to some degree across many different problems. Unfortunately, this generality comes with a lack of specificity, and risks training a model on syntactic features of the representation ("noise") rather than true details of the structure ( 11 ).
Most fingerprinting methods do lose information upon encoding, resulting in vectors that tentatively represent several different molecules ( 12 , 13 ). Typically, the loss of information is coupled to specific advantages such as improved speed, lower memory usage, algorithmic advantages, or increased density of useful information. For instance, Daylight fingerprints can exist as long, variable length fingerprints, or they can be 'folded' into a fixed-length representation, gaining speed and memory efficiency at the cost of reversibility ( 10 ). In addition to losing information, compressed (folded) fingerprints are prone to introducing algorithmic noise, leading to potential false equivalencies. While reversibility may be desirable in some applications, its is not always necessary.
The relationship between molecular features and chemical activity is established during the training stages of supervised learning algorithms. Chemical activities are predicted using industry-standard ML approaches for supervised learning, such support vector machines (SVMs), random forest modeling, ridge classification and neural networks. Typically, each algorithm has its own specific set of hyperparameters that influence the performance of the trained models. The problem of knowing which hyperparameters to use is itself an interesting and non-trivial topic of research ( 14 ). In practice, the optimal combination of hyperparameter values also varies across tasks and must be calibrated through a trial-and-error based optimization process. This high computational cost of retraining models can be a deterrent for incorporating new training data at a later stage, although this drawback is partially mitigated in some neural network based approaches through recent advances in transfer learning ( 15 , 16 ). When a supervised learning algorithm successfully establishes relationships between molecular features and chemical activities, the resulting model can be used to predict the chemical activity of new, previously unseen molecules.
Model interpretability is another important factor in the selection of supervised learning algorithms. Some algorithms, for example decision trees, provide human interpretable justifications for the associations between molecular features and chemical activities. For other algorithms, such as neural networks, the reasons for any given prediction can often be extremely hard to determine. In practice, the "black box" nature of these models can inhibit trust and possibly reduce the real-world utility of the model. Understanding this rationale can lead to new strategies for reducing bias, overfitting, and possibly even improve theory ( 17 ). Ideally, each prediction generated by the model has a clear rationale behind it, which can be understood by the researcher using the model.
Unfortunately, the combination of variable context-dependent molecular representations, choice of supervised learning algorithms and hyperparameterization is typically inconsistent between predictive tasks. Any specific combination of these three factors optimized for a specific chemical activity model will not be applicable to predicting other activities. In practice, prediction of seemingly related chemical properties may end up requiring models with different molecular features, algorithms, and/or parameters following substantial development effort by expert chemoinformaticians.
One approach to addressing inconsistency in the choice of molecular fingerprints and tentatively boosting performance is to use multiple fingerprints simultaneously as features. It is clear that distinct molecular representations possess varying amounts of useful information relating to specific problems ( i.e. each molecular representation discards information, but not every method discards the same information). So, two fingerprints of the same molecule produced by two different low-information fingerprint methods will share some redundant information about the structure, and contain some unique information. By combining molecular representations, a greater amount of "true" information can be captured from the complete structure of the molecule. A naive way to combine molecular representations is to append the vectors generated by each fingerprint directly onto each other, to create a new, longer fingerprint, of multiple types. This approach has at least two major disadvantages: [1] for some supervised learning algorithms, the computational costs may become prohibitive when concatenating multiple fingerprints, due to increased length of the vector; and [2], the concatenation approach is prone to the creatively named "curse of dimensionality" ( 18 -20 ). This phenomenon occurs when the ratio of training data to features is low, and overfitting becomes not just possible, but likely. Other methods approach this issue by using voting schemes to try to weigh consensus between models built using different fingerprints ( 21 ), or by nesting the selection of a fingerprint with hyperparameter selection to empirically determine the most effective fingerprint ( 22 ).
In this report, we describe Pareto-Optimal Embedded Modelling (POEM), a novel supervised learning method that reliably creates accurate models for predicting drug activity based on multiple representations of molecular structure. POEM evades the curse of dimensionality by using pareto multi-front optimization ( 23 , 24 ) to massively shrink the number of dimensions that define molecular similarity, in a context-dependent manner. A Pareto optimization algorithm is a powerful general approach for identifying optimal solutions in cases where there are more than one metric to optimize, that may not always be in agreement with each other. It has found broad applications including protein structure minimization ( 25 , 26 ) and lead optimization in drug design ( 27 ). Effectively, the context-dependent dimensionality reduction introduced by POEM enables the use of multiple fingerprints to describe every molecule in the reference molecule library, without introducing a risk of overfitting. The specific use of a pareto-based approach to defining similarity ensures that all comparisons remain 'like-to-like' and avoids the need for heuristic transformations and weighting schemas. POEM also has a number of additional functional advantages: the rationale for predictions are each interpretable, the algorithm has no hyperparameters, and models can be easily updated with new labeled reference molecules. This approach was designed for the rapid generation of multiple predictive models, without the need for expert intervention. We demonstrate the generalizability and consistency of POEM across a broad range of predictive tasks by modeling 17 ADMET properties of interest to pharmaceutical drug development.

Strategy
POEM uses a Pareto dominance definition to combine multiple definitions of molecular similarity, into a robust metric suitable for supervised learning tasks. The POEM method has four major steps: [1] fingerprinting, [2] embedding the known molecules into a limited context based on similarity, [3] calculating Pareto dominance relationships for the known molecules, and [4] converting these dominance relationships into similarity scores and predictions. A high-level flowchart describing this process and the outputs at each step is provided in Figure S1.
Step 1: Fingerprinting N molecular fingerprints techniques are applied to the target molecule and the reference library of M labeled compounds.
Step 1 results in a matrix of M x N fingerprints for reference library and a vector of N fingerprints for the reference molecule. For this study, ten diverse and widely-used fingerprints were chosen. Detailed parameters and references for these ten fingerprints are provided in TABLE S1.

Step 2: Embedding Known Molecules
The fingerprint representations of the target molecule are embedded onto the chemical landscape of the reference library via a non-reversible transformation. For each fingerprint, Tanimoto distances (28) ( Figure 1E) are calculated between the target molecule and all reference molecules in the library.
Step 2 results in a matrix of M x N distance values, centered on the target molecule.

Step 3: Calculating Dominance Relationships
Although each molecule in the reference library is represented by a set of N distances, the specific values of the distances are not directly comparable. A distance of 0.4 may represent a significant match for one fingerprint, but random noise for another. To resolve this issue, Pareto dominance relationships are used to establish an overall distance. Here, the target molecule is selected as the ideal objective for multi-front optimization and the set of fingerprints represents N -dimensional space (27,29). Dominance relationships between molecules from the reference library defined on the basis of which molecule is closer to the target. One reference library molecule may be closer than another to the target for all N distances, or a subset of the N dimensions. When evaluating the dominance of one reference library molecule ( A ) to another labeled molecule ( B ) across all 10 distances, closer distances are assigned a value of 1, ties are assigned a value of 0.5, and further distances are assigned a value of 0. A comparison vector AB = [1, 0, 1, 0.5, 0, 0.5, 0.5, 1, 1, 1] would indicate that molecule A is more similar to the target molecule in five fingerprint representations, tied in three representations, and more dissimilar using two fingerprint representations.
Step 3 results in an M x M symmetric matrix of dominance relationships.
In a naive Pareto scheme, a molecule would dominate another molecule all its distances were as close or closer to the target molecule, and at least one distance was closer ( 26 ). POEM relaxes the naive definition of dominance, allowing a molecule to claim dominance over another, even if <=10% of its distance comparisons remain further from the target. In practice, this relaxation yields more dominance relationships overall when a larger number of fingerprints is used. The added dominance relationships reduce the likelihood of ties, which helps establish a complete ranking of all molecules in the labeled library relative to the target. Sample code for evaluating these relationships is provided in Supplemental Pseudocode 1.

Step 4: Calculating Fitness Scores and Final Prediction
Labeled reference molecules are ranked by their similarity to the target molecule by converting dominance relationships to a single-value fitness score . For a given molecule, its fitness is defined as: This schema favors labeled reference molecules which compare favorably "on average" for all fingerprints ( MeanDominance ), which dominate many other molecules ( NumDominating ), and which are not being dominated by others ( NumSubmitting ). This approach favors labeled molecules that are "best-of-class" across all metrics of similarity to the target molecule. Sample code for ranking molecules is provided in Supplemental Pseudocode 2.
Labeled reference molecules are ranked according to their fitness scores and summed to provide a "total fitness" value. In practice, fitness values vary by orders of magnitude between the most similar and dissimilar molecules. In some cases, the top few molecules could contribute the vast majority of the weight towards the summed fitness value. Alternatively, contribution towards the total fitness value may be more broadly distributed across the reference molecule library. The relative contribution of each labeled reference molecule to the total fitness score is then used as a weight towards each observed class label. Weighted averages are treated as probabilities of the target having any given label. Sample code evaluating probabilities is provided in Supplemental Pseudocode 3.
Step 4 results in a Length M fitness vector of similar molecules, which is used to assign probabilities to each label class.

Benchmarking POEM to Standard Approaches with 17 ADMET property predictions
POEM was benchmarked relative to five standard supervised learning algorithms, across 17 predictive tasks related to drug ADMET properties. All 17 ADMET datasets were taken from public sources and range between 522 and 6505 labeled reference molecules. Each molecule was represented by a SMILES string. RDKit release 2018.09.1 was used to parse, canonicalize, and featurize small molecules. Molecules that could not be automatically processed by RDKit and molecules that have both positive and negative data labels were excluded from each dataset. Redundant data points representing experimental replicates were also removed. The Supplementary Text contains references and descriptions for each dataset used in this study, including the total number of training examples for each label after dataset cleaning. Python's scikit-learn package v0.19.1 ( 30 ) was used to build models for each of the five standard supervised learning algorithms: Gradient Boosting Classifier, Random Forest, Ridge Classifier, Stochastic Gradient Descent Classifier, and Support Vector Machine, representing a range of industry-standard supervised classifier types, which are suitable for the dataset sizes in this benchmark study. Each of the five standard supervised learning algorithms was trained using a grid search strategy for hyperparameter optimization, with an added nested layer evaluating performance separately for each of the fingerprints listed in Table S1. In contrast, POEM has no hyperparameters and considers all fingerprints simultaneously. In addition, a Molecule Graph Convolution (31, 32) model was trained using the GraphConv model in the DeepChem python package (33), on all 17 datasets, to provide a comparison to state-of-the-art deep neural network methods. Results are reported both from a naive model with default parameters (2 convolution layers of size 64, one dense layer of 128, 75 features per atom, 10 training epochs), and a hyperparameter optimized model (parameters selected after 20 rounds of 5-fold cross-validated Bayesian optimization, for reference see Table S4). Finally, we also evaluated POEM models that were limited to using only individual fingerprints, to provide an added comparison to a non-consensus, similarity-based classifier approach.

Cross validation of predictive models
For each of the five standard supervised learning approaches, models were trained and evaluated with a five-fold cross-validation strategy, using 80% of the dataset for training and hyperparameter optimization and withholding 20% for blind performance evaluation. The same 80% / 20% split was used to create a POEM test set , which provides a direct comparison to the five standard algorithms. POEM is also amenable to full 'leave-one-out' cross validation due to the lack of a computationally-expensive training process. This corresponding POEM Full Set evaluation provides an indication of predictive robustness with respect to dataset size. Additionally, nested cluster validation was performed as in Mayr et al. (34), with the added restriction that the test set must contain at least one example of each class.

Results:
POEM outperforms five standard supervised learning methods for predicting 17 ADMET properties Table 1 provides a comparison between POEM performance relative to the top performing fingerprint/algorithm combinations generated from standard supervised learning approaches, across 17 different ADMET tasks. Across all tasks, the POEM Test Set outperforms the standard supervised learning algorithms, as determined by ROC area under the curve (AUC) score, in some cases by more than 10% (Figure 2). Additionally POEM performs approximately as well, or better, than the GraphConv neural network (Figure 2). Validation scores associated with standard classifiers are consistently better than test scores (Figure 3), indicating some degree of overfitting. In contrast, the POEM Full Set scores are sometimes higher and sometimes lower than the POEM Test Set scores. Generally, similarity between these scores is an indication of predictive robustness and resistance to overfitting. A notable exception in both cases is the AR dataset, which shows the largest score disparity between the POEM Full Set , POEM Test Set. This disparity and relatively low predictive performance may indicate an insufficient representation of chemical space in the underlying dataset.
To assess predictive robustness with regards to the random test set selection, 100 different 80% / 20% testing splits were performed on the Blood Brain Barrier (BBB) and the Caco-2 Permeability (Caco2) datasets ( Figure S2). ROC AUC was computed for all 100 using both POEM and the previously determined best model and hyperparameters for each approach (as reported in Table S2). Compared to the standard classifiers, the POEM performance is higher for each test set, and overall shows less variability.
To assess POEM's ability to generalize, a nested cluster validation approach was used. This approach provides insight into how well POEM can make predictions for molecules highly dissimilar to any known reference molecules. We see that POEM does generalize well, especially for certain datasets (eg. BBB in Figure S3), though in some other cases performance is poor on dissimilar molecules (eg. AR in Figure S3). This approach was also applied to the GraphConv neural network, with comparably good results ( Figure S4).
We also compared POEM performance with literature-reported results for models trained on the same datasets. Kansen et al. report expert-optimized models for the Ames Mutagenicity dataset built using molecular descriptors and seven binary classification tools ( 35 ). The top performing model in that study was a support vector machine (SVM) with ROC AUC of 0.86, whereas the non-parametric POEM automatically generated comparable models with 0.87 AUC. This study shares three common datasets with AdmetSAR, a predictive tool that uses substructure-based descriptors and support vector machines ( 36 ). AdmetSAR reports five-fold cross-validation ROC AUC values for: Blood Brain Barrier (0.9517), Human Intestinal Absorption (0.9458), Caco-2 Permeability (0.8216). These models were generated using different combinations of three fingerprints and three binary classification algorithms ( 36 ). The Androgen Receptor (AR) activity dataset was taken from the Tox21 Challenge, a federal collaboration involving NIH, EPA and the FDA aimed to develop better toxicity assessment methods. For this sub-challenge, 31 teams contributed different predictive models, with the leading ROC AUC at 0.828 ( 37 ). Consistently, POEM matched or outperformed expert-developed models reported in the literature.
We also compared the standard POEM approach to a modified variation that uses individual fingerprints rather than a consensus of ten, reporting all leave-one-out cross validation results for each dataset in Table S3. As expected, the standard consensus approach performs reliably well, appearing among the top models for each task and clearly outperforming any individual single-fingerprint across all tasks. The use of multiple fingerprints consensus also helps establish meaningful confidence scores associated with the predictive tasks. Figure 4 presents the distribution of POEM-predicted probability for the 'correct' label for the Caco-2 permeability dataset, across the consensus 10FP POEM, and each of the ten single-fingerprint models. In this figure, data points that lie below 0.5 represent an incorrect prediction and points closer to 100% and 0% represent higher confidence predictions. The Caco-2 dataset was chosen as an example dataset to demonstrate this principle, since the consensus approach ranked lower than three other single-fingerprint models. Performance varied across all fingerprinting methods, and was of approximately average predictive power overall. This figure demonstrates that the consensus model is better able to capture the confidence of a given prediction, as most of the observed incorrect predictions were made with lower confidence than for the single-fingerprint models. The single-fingerprint models almost exclusively produce highly confident correct predictions, or highly confident incorrect predictions.

Discussion:
At its core, POEM is a method based on measuring similarity, conceptually similar to a K-nearest neighbor approach ( 20 ). The predictive power of the method is based on the assumption that molecules with similar structures have similar properties. This approach has an established tradition ( 38 ), and is not in itself novel. POEM differs from these other approaches by intentionally restricting comparisons to relative similarity (through the embedding stage). Information is lost during the transformation of quantitative distances to relative similarity, such that similarity relationships evaluated are only meaningful in the context of the specific target molecule and predictive task. Specifically, the magnitude of any given distance has variable significance across fingerprints and predictive tasks. By ignoring the quantitative distribution of Tanimoto distances and operating only on less/greater comparisons between like fingerprints, POEM's treatment of reference molecule data is statistically non-parametric . The transformation to a 'distribution-free' representation of distances sidesteps the need for fingerprint-related transformation functions, weights or voting schemas. In turn, the entire landscape of labeled reference molecules can be used in making each prediction in a consistent and systematic manner, without the need to introduce fingerprint-related hyper-parameters to optimize from task-to-task. This approach leads to fast and objective model building, two major functional advantages of the algorithm.
While POEM's treatment of input data is statistically non-parametric, modification to the algorithm itself may impact performance. For example, adding new fingerprints may further improve performance. Future studies on larger datasets may identify new such modifications that produce significantly different optimal configurations from task-to-task, leading to optional hyper-parameters and a potential for optimization. Nonetheless, the results in this study demonstrate consistent performance using a static algorithm configuration and fingerprint selection. In this context, POEM is a powerful general purpose supervised machine learning approach that does not require hyperparameter optimization. POEM was designed to reduce the need for highly-tuned models, crafted by ML experts. Conceptually, the POEM algorithm is also applicable to problems outside of chemistry, as long as the object of the prediction has multiple representations which have metrics to define similarity.
We have shown that the increased performance and consistency of POEM is attributed to the use of multiple fingerprints simultaneously, as previously observed in other similarity approaches such as the Similarity Ensemble Approach ( 38 ). POEM models created using 10 fingerprints outperform those created by individual fingerprints with few exceptions, which may be attributed to random variation. This is further supported by the observation that the optimal fingerprint differs on a task-by-task basis when evaluating the standard supervised learning methods (Table 1) or single fingerprint POEM models (Table S3). Even seemingly related tasks, such as Cytochrome P450 activity predictions, are best addressed with different fingerprints for different isoforms. If algorithmic noise is responsible for variation between models generated using different fingerprints or hyperparameters, then model performance may be compromised in subsequent real world applications. The use of multiple fingerprints simultaneously in POEM side-steps this issue, providing reliable performance across a range of predictive tasks. Efforts were initially made to also build predictive models for benchmarking using concatenated fingerprints, but computational runtime was deemed cost-prohibitive, demonstrating instead a distinct speed advantage to the POEM strategy for combining representations.
POEM is also seen to generalize well, in terms of the ability to make predictions for molecules unlike any known reference molecules ( Figure S3). More specifically, we observe that it performs as well or better than the GraphConv neural network ( Figure S4). This network in particular was chosen because it is known to be highly performative (31), and was also not too cost-prohibitive (though we were forced to limit the amount of nested cluster validation to two properties due to compute cost concerns).
While this study limits the scope of POEM to classification problems, the fundamental relationship between molecular similarity and activity established by POEM may provide a suitable framework for developing regression models. Preliminary findings applying POEM to four standard benchmark datasets presented in Supplemental Figure 5 demonstrate favorable performance over leading deep learning frameworks to develop regression models for chemical activity. Future studies using a broader range of datasets would provide a broader understanding on POEM's utility towards regression problems.
We have identified a number of functional advantages to POEM, including model building speed, reliability, predictive power, objectivity, ease of use, and model interpretability. POEM is particularly well suited to applications requiring automation due to its objective nature and reduced risk of overfitting. Highly automated applications may include: models built upon large-scale data mining expeditions, datasets with frequent updates, or model building by subject-area field experts without first-hand experience developing ML models. The similarity-based nature of this algorithm helps provide model interpretability, as each prediction is coupled with the list of reference molecules, their relative similarity, and their labels.
The above notwithstanding, there are important trade-offs associated with the POEM approach. Mainly, POEM has high algorithmic complexity associated with the generation of an M x M dominance matrix. Due to POEM's instance-based learning nature, predicting each unlabeled molecule scales proportionally with the square of the number of molecules in the reference molecule library. Algorithmically, this prediction stage is significantly slower when compared to other standard supervised learning methods. In practice however, POEM can handle datasets up to 100,000 training examples on modern personal computers. Applying POEM to ligand-based virtual screening using libraries with millions of molecules may also impose a technical challenge, requiring distributed computing solutions. Future heuristic approximations may improve POEM dataset scalability. Nonetheless, the lack of dedicated 'training' and 'optimization' stages make up for this limitation in applications where fewer predictions need to be made. For instance, POEM models can be easily improved over time, without added computational cost, simply by adding new data into the set of labeled reference molecules. Additionally, we have observed that highly unbalanced datasets can behave poorly when using POEM to make predictions, and additional dataset balancing might be desirable for producing highly performant models.
As is always the case with machine learning approaches, the main determinant for predictive performance is the nature and quality of the data used for training. This is observed here in the contrast between good models (BBB, ERa and HIA) and bad models (Carcin, CYP), especially when looking at generalizability ( Figure S2). Unfortunately, in the world of drug-property prediction, many of the best datasets are privately held, despite exciting but limited recent efforts to make some data available to researchers and the public ( 39 ).
In spite of the above limitations, we consider POEM to be a valuable addition to the roster of methods for supervised learning available today, especially given its lack of hyperparameters, high generalizability, and low cost. Shown are the ROC AUC scores for full leave-one-out prediction with POEM (POEM full set, black circles), a 20% test set with POEM (POEM test set, black triangles), and the same 20% test set for best performing fingerprint and hyperparameters for each industry-standard ML method (colored straight lines). In addition, reported scores from AdmetSAR ( 36 ) (gray squares), an ADMET prediction tool, other Literature Values ( 31 , 33 ) (grey ovals), and a naive (gray diamonds) and hyperparameter optimized (gray rhomboids) GraphConv (31) neural network are also shown when applicable.

Figure 3. Comparison of ROC AUC scores on test data versus leave-one-out cross validation scores (for POEM) or pre-validation scores (for industry-standard classifiers). (A)
ROC AUC scores are shown (y-axis) for each best performing classifier, hyperparameter, and fingerprint combination, for each of the 17 properties, for the same 20% test set as used in the POEM comparisons, and the mean of the 5-fold cross-validation score (Pre-Validation ROC AUC, x-axis), as assessed during model training. A dotted line is shown as a reference along the diagonal. (B) Shown are the POEM ROC AUC scores for each of the 17 properties being predicted, for a 20% test set (y-axis), and for leave-one-out cross validation on the remaining 80% set (the same set as in A ) (x-axis). A dotted line is shown as a reference along the diagonal. Tables: Table 1. A summary of the performance (ROC AUC score) of 17 ADMET properties of POEM and best-performing, standard classifier-fingerprint combination. POEM Test: 80%/20% testing split for a direct comparison to traditional approaches, POEM Full: 'leave-one-out' full cross-validation.

Inventory of Supplementary Materials For: Predicting drug properties with parameter-free machine learning: Pareto-Optimal Embedded Modeling (POEM)
Supplemental Methods: Supplemental Pseudocode 1. Evaluating dominance relationships Supplemental Pseudocode 2. Evaluating absolute distance rankings to target molecule Supplemental Pseudocode 3. Evaluating the probability of each label Figure S1. Flowchart Outlining the POEM Algorithm Table S1. List of 10 fingerprints used by POEM, using the RDKit implementation Supplemental Results: Table S2. List of best-performing classifier, hyperparameter, and fingerprint Table S3: Non-consensus POEM evaluated on each fingerprint and task separately Table S4: ROC AUC scores and optimized GraphConv hyper-parameters Figure S2. Comparison of Blood Brain Barrier (BBB) and Caco-2 Permeability (Caco2) ROC AUC score distributions Figure S3. Nested cluster validation of POEM models for 17 ADMET properties Figure S4. Comparison of Blood Brain Barrier (BBB) and Caco-2 Permeability (Caco2) ROC AUC score distributions for POEM and GraphConv models using nested cluster validation Figure S5.      Figure S2. Comparison of Blood Brain Barrier (BBB) and Caco-2 Permeability (Caco2) ROC AUC score distributions. Shown are standard box plots of the distribution of ROC AUC scores from either a withheld 20% test-set (red) or an 80% 'training' set (blue) that represents the pre-validation scores for the industry-standard methods, or the leave-one-out cross-validation scores on the same reference molecule library when used by POEM. Model performance was calculated from 100 random splits, each split being used to generate both a POEM model, and an industry-standard model, for BBB and Caco2 (using their respective best method, fingerprint, and parameters, as listed in Table S2). Figure S3. Nested cluster validation of POEM models for 17 ADMET properties. Shown are standard box plots of the distribution of ROC AUC scores from ten repeats of a nested cluster validation (34). Points are overlaid, and show the ROC AUC scores directly, colored to indicate the ratio of molecules in the test set relative to the 'training' set. The x-axis shows the minimum Tanimoto distance (calculated on Morgan R4) between any test set molecule and any training set molecule. As this distance increases, more molecules are removed from the training set, and predictions are being made on molecules that are increasingly unlike any reference molecules in the training set. Figure S4. Comparison of Blood Brain Barrier (BBB) and Caco-2 Permeability (Caco2) ROC AUC score distributions for POEM and GraphConv models using nested cluster validation. Shown are standard box plots of the distribution of ROC AUC scores from ten repeats of a nested cluster validation (34). Points are overlaid, and show the ROC AUC scores directly. Scores are compared for Naive GraphConv (NGC) (red dots), hyperparameter optimized GraphConv (OGC) (blue dots), and POEM (green dots). The top panel label shows the minimum Tanimoto distance (calculated on Morgan R4) between any test set molecule and any training set molecule. As this distance increases, more molecules are removed from the training set, and predictions are being made on molecules that are increasingly unlike any reference molecules in the training set. The rightmost panel label indicates which property is being predicted, BBB, or Caco2.

Figure S5. Preliminary Observations of Regression Performance of POEM compared to
GraphConv. Shown are standard box plots of the distribution of root mean squared error on the test set from ten repeats of a cluster validation (34). The errors are examined for 4 properties (each chart) and are compared for Naive GraphConv (NGC) (orange boxes), hyperparameter optimized GraphConv (OGC) (green boxes), and POEM (blue boxes). The x-axis shows the minimum permitted Tanimoto distance (calculated on Morgan R4) between any test set molecule and any training set molecule. As this distance increases, more molecules are removed from the training set, and predictions are being made on molecules that are increasingly unlike any reference molecules in the training set.