A Diversified Machine Learning Strategy for Predicting and Understanding Molecular Melting Points

Molecular The ability to predict multi-molecule processes, using only knowledge of single molecule structure, stands as a grand challenge for molecular modeling. Methods capable of predicting melting points (MP) solely from chemical structure represent a canonical example, and are highly desirable in many crucial industrial applications. In this work, we explore a data-driven approach utilizing machine learning (ML) techniques to predict and understand the MP of molecules. Several experimental databases are aggregated from the literature to design a low-bias dataset that includes 3D structural and quantum-chemical properties. Using experimental and polymorph-induced uncertainties, we derive a tenable lower limit for MP prediction accuracy, and apply graph neural networks and Gaussian processes to predict MP competitive with these error bounds. To further understand how MP correlates with molecular structure, we employ several semi-supervised and unsupervised ML techniques. First, we use unsupervised clustering methods to identify classes of molecules, their common fragments, and expected errors for each data set. We then build molecular geometric spaces shaped by MP with a semi-supervised variational autoencoder and graph embedding spaces, and apply graph attribution methods to highlight atom-level contributions to MP within the datasets. Overall, this work serves as a case study of how to employ a diversified ML toolkit to predict and understand correlations between molecular structures and thermophysical properties of interest. Abstract The ability to predict multi-molecule processes, using only knowledge of single molecule structure, stands as a grand challenge for molecular modeling. Methods capable of predicting melting points (MP) solely from chemical structure represent a canonical example, and are highly desirable in many crucial industrial applications. In this work, we explore a data-driven approach utilizing machine learning (ML) techniques to predict and understand the MP of molecules. Several experimental databases are aggregated from the literature to design a low-bias dataset that includes 3D structural and quantum-chemical properties. Using experimental and polymorph-induced uncertainties, we derive a tenable lower limit for MP prediction accuracy, and apply graph neural networks and Gaussian processes to predict MP competitive with these error bounds. To further understand how MP correlates with molecular structure, we employ several semi-supervised and unsupervised ML techniques. First, we use unsupervised clustering methods to identify classes of molecules, their common fragments, and expected errors for each data set. We then build molecular geometric spaces shaped by MP with a semi-supervised variational autoencoder and graph embedding spaces, and apply graph attribution methods to highlight atom-level contributions to MP within the datasets. Overall, this work serves as a case study of how to employ a diversiﬁed ML toolkit to predict and understand correlations between molecular structures and thermophysical properties of interest.


Introduction
The accurate determination of bulk thermophysical properties of molecules and polymers using only single molecule structure is a topic of critical academic and industrial interest.
Historically, machine learning (ML) methods including quantitative activity-structure relationships 1 (QSAR) and quantitative property-structure relationships 2 (QSPR) methods have dominated in silico predictive molecular modeling, with considerable success across a broad array of prediction tasks including molecular solubility, biological toxicity, and ther-mophysical properties. These classes of data-driven, quantitatively predictive models, when incorporated with high-throughput screening and design efforts, can aid in the identification, generation, and characterization of molecular species with applications in drug design, 3 organic electronics, 4 and solar fuels materials, 5 among many others.
A molecular property of interest to a variety of critical industrial applications is the prediction of a molecule's melting point (MP). Not only does a molecule's MP define the temperature at which a material transitions from solid to liquid, but it can be correlated with a number of industrially vital material properties. For example, solubilities of candidate drug-like molecules are often estimated using a general solubility equation (GSE) approach, where one of the two inputs is the MP of a molecule. 6,7 The recent emergence of interest in ionic liquids has made the correlation of MP with ionic liquid structure a critical endeavor, especially as it pertains to their stability. 2,8 MP can also be well-correlated with a liquid's viscosity. 9 Molecular and polymeric glass transition temperatures are bounded by the MP of a material, making a priori knowledge of the MP a useful insight regarding anticipated glass transition temperatures and mechanical properties. In any application where high-throughput screening is an avenue for material discovery, accurate MP prediction will determine the scope of practical candidate materials.
Two computational strategies exist for MP prediction: physics-based models and ML approaches. From a physics-based approach, ab-initio or classical molecular dynamics simulations can be utilized to predict the MPs of molecular solids. 10,11 However, many of these simulation techniques depend strongly upon the challenging feat of predicting the 3D crystal packing of molecular structures a priori, though methods have been devised to circumvent this limitation. 12 Moreover, the accuracy of classical force-fields required to predict MPs is often insufficient to handle the vast diversity of chemical structures and intermolecular interactions present in many materials. Thus, even if 3D crystal structures could be known a priori, errors derived from force-field approximations are difficult to quantify, and molecular dynamics simulations, force-fields, and workflows are not yet easily scalable for high-throughput screening approaches across chemically diverse structures. Alternatively, ML strategies have been widely employed for the prediction of MPs, with models routinely achieving prediction errors of 35-50K. 2,[13][14][15][16][17][18][19][20][21][22][23] These models rely on the use of experimentally determined MPs, combined with supervised ML, to regress molecular structures to experimental MPs. Within the last five years, the quantity of experimentally available MP data has drastically increased thanks to the efforts of Tetko, with recent work accruing over 200,000 experimental MPs derived from the patent literature 14,15 .
In this work, we apply data-driven and ML approaches to understand and predict molecular MP, working with the SMILES representations of molecules as input for the ML algorithms. The primary contributions of this work are: • An integrated experimental and quantum-chemical dataset. Aggregating from several literature sources, we build a dataset of ∼47k molecules with experimental MP, including augmentation with 3D molecular structures and quantum chemical properties. By studying the literature on crystal polymorphs and experimental measurement procedures, we quantify the expected experimental error in MP reporting, providing useful anticipated bounds for MP prediction errors.
• State-of-the-art MP prediction from single-molecule properties. We employ graph convolution neural networks (GCN) and Gaussian Process Regression (GPR) to obtain MP predictions that approach the underlying experimentally limiting uncertainties in the data sets. Additionally, we assess the role of 3D structural, conformational, and quantum-chemical descriptors and their abilities to improve MP predictions.
• Diversified chemical analysis. We deploy a variety of unsupervised and supervised tools to characterize the relationship between molecular structure and MP. We build geometric spaces shaped by MP with semi-supervised variational autoencoders and graph embedding spaces, and inspect individual atomic contributions to MP via graph attribution methods. We partition our molecular space into clusters and characterize fragments in each cluster to better understand the chemical composition and organization of molecules based on MP.
In what follows, we discuss the methods used in this study, for which more technical details can be found in the supporting information (SI). We then present results and discussion around each of the methods. Finally, with the aim of transparency we make our dataset and code available on GitHub. The code repository to build models and make predictions, along with datasets themselves, can be found at https://github.com/argonne-lcf/melting_ points_ml.

Methodology
A schematic diagram of the methods employed to study MP is depicted in Figure 1, and each of the components is briefly detailed hereafter. More detailed descriptions (e.g. hyperparameters) can be found in the SI.

Datasets
Four publicly available data sets of experimental MPs were chosen for this study, as outlined by Tetko. 13 The statistics associated with each of the data sets, including the combined data (labeled "All"), are summarized in Figure 2 a). The Bradley data set is a highly curated, "gold" standard 24 for MP data sets, and has been double-validated to only contain data with multiple reported measurements within 5 C. The Bergström data set, 21 which is an order of magnitude smaller in sample size, was also generated via rigorous manual curation. Additionally, most of the compounds reported in the Bergström data set fall well within a subset of the MP range of the Bradley data set. For these reasons, the Bergström and Bradley data sets were merged for this study. The Enamine data set was created by Enamine Ltd, 25 a chemical supplier. The OCHEM data set was derived from a diverse pool of non-curated data from the Online Chemical Modeling Environment (OCHEM). 26 Authors interested in further analysis of these datasets beyond our own should consult the excellent work of Tetko. 13 In this work, we augment the original data sets of Tetko by including a variety of structural and quantum-chemical descriptors, as outlined in the SI.

MP Prediction
For MP prediction we use three supervised learning methods: Random forests (RF), GPR, and GCN. Each model encompasses a different type of modelling approach for QSAR applications. RF are an ensembling approach that aggregates several randomized decision trees and pools predictions from each to generate more robust property estimates. RF are used frequently in QSAR applications since they are robust to different modalities of data and are straightforward to apply. GPR are models that combine features of Bayesian linear regression and kernel ridge regression to generate a distribution of functions that best fit the data based on gaussian assumptions. GPR rely on learning functions (kernels) that use relative distances between data points to make predictions. Predictions on a data point x are reported as the mean of a gaussian distribution and the standard deviation represents the uncertainty bounds for prediction. Both RF and GPR rely on predefined features (e.g. fingerprints, quantum-chemical properties, etc.) to represent molecular structures, whereas GCN 27 directly utilize a graph-structured representation of a molecule, with atoms as nodes and bonds as edges of a graph. GCN learn a vectorized representation of a molecule which can be used with another model, such as a multilayer perceptron (MLP), and trained end-to-end. GCN works by iteration; for each node it aggregates neighboring local graph information and transforms it via an MLP to retrieve a new node representation. It then projects all nodes to a graph-level vector which can be thought of as task-optimized fingerprints. 28 All graph operations are designed to preserve graph symmetries.

Geometric Spaces
In this work, we construct geometric spaces structured around MP that allow us to better understand how molecules are structured according to MP, as well as serve as a sanity check for when molecules do not follow the distribution of a dataset. Since these latent spaces are high-dimensional, we reduce their dimensionality for visualization purposes using linear principle components analysis (PCA). To construct these geometric spaces, we apply two distinct methods: semi-supervised variational autoencoders (SSVAE) and the penultimate layers of GCNs. SSVAEs are generative models that learn to encode data into a vector representation in a latent space, and then decode the data back to its original representation.
Both operations are modeled with neural networks and optimized concurrently.
Bombarelli et. al. 29 first demonstrated the usage of VAE with SMILES strings to generate new molecules with drug-like properties using the Zinc 30 dataset. One key result was the ability of the SSVAE to shape the organization of the latent space representation of molecules based on the predicted properties of interest. VAE requires large amounts of data to be able to generalize to new molecules; since our labeled dataset has ∼47k, we rely on semisupervised learning to leverage larger unlabeled datasets. In our network, we have a MP predicting neural network that maps the latent space to predicted MPs. Because we want our latent space to be informed by molecules that have been synthesized and exist on a shelf somewhere in the world, we used a set of 1M purchasable molecules from emolecules to inform the chemical structure of the latent space. 31 We train the VAE with a mix of labeled and unlabeled data: for each batch we mask the loss function that predicts properties. To ensure that we are able to construct grammatically valid SMILES, we use SELFIES. 32 Our geometric space is then the latent space of our SSVAE. This space can also be used for the inverse design of materials due to its built-in decoding capability. 33 Another means of constructing geometric spaces is by examining the space of activations inside neural networks by analyzing the penultimate layer in a GCN. In the case of regression, the ultimate layer will be a linear model, so if the entire model is accurate, the penultimate layer can be used to embed molecules, and these molecules should be organized on a gradient since the GCN will have to fit a line across this space in order to predict MP.

Graph Attribution
Graph attribution refers to the task of attributing values to elements of a graph. If we care about building interpretable predictions in GCN, then we wish to create a graph attribution method that assigns positive or negative weights to graph elements in relation to their importance for prediction. 34 For this purpose, we utilize grad-CAM 35 with GCNs. These methodologies have been previously explored in the context of drug-like properties. 36 Grad-CAM uses gradient information flowing into the convolutional layer of a GCN to understand the importance of each neuron for a given task. By normalizing this information, we are able to build a heatmap delineating the contributions for each node in a molecular graph.
It should be noted that the heatmap for each molecule is a local explanation, that is, the relative weights between different molecular heatmaps are not directly comparable.

Active Sampling and Clustering
An important approach in active learning is the exploitation of the cluster structure within the underlying dataset. 37,38 Our aim is to perform active sampling based on unsupervised learning that can exploit the cluster structure of the MP data sets, thereby reducing the bias arising from passive sampling (i.e. random sampling). Unsupervised clustering involves organizing unlabeled data into groups of common clusters based on their similarity. 39 We apply unsupervised clustering to all of the unlabeled chemical data from our MP data sets with the following two goals: 1) To gain a qualitative understanding of uncertainties in the data sets arising from curation qualities, experimental conditions, and experimental techniques and 2) To create realistic chemical similarity-based test/train splits for use in supervised ML, and to understand the impact of this active sampling on MP prediction.

Limits to Prediction
With the curation of large experimental MP data sets, it is necessary to estimate the inherent limitations of MP prediction, specifically as it relates to experimental uncertainties, sample purity, crystal polymorphs, and data parsing and recording. What magnitude of error is expected from the presence of crystal polymorphs? What is the intrinsic MP precision when an experiment is done "perfectly"? When uncertainty is incorporated into the experimentally recorded values of the MP, how does this influence our expectations of the maximum obtainable predictive accuracy? In what follows, we consider three primary contributions to MP error: the presence of crystal polymorphs, experimental errors/uncertainties, and errors in data recording.

Crystal Polymorphs
The presence of crystal polymorphs with distinct MPs for a given molecular structure can impact the predictive accuracy of the ML algorithm. As ML algorithms generally only predict a single value of the MP for a given molecular structure, the predicted MP value likely corresponds to the MP of a single molecular polymorph, the identity of which is usually unknown. Consequently, this polymorph ambiguity introduces an inherent uncertainty in the prediction task. Naively, one can grasp the potential magnitude of such differences in MP by considering a substance such as cocoa butter, and the fact that its industrially relevant polymorphs are known to be separated by nearly 20K. 40 To more quantitatively approach the issue, we have examined the MPs of 119 unique crystal polymorphs gathered from the experimental literature. 41,42 In this data set, we have computed the size of the MP interval in K for all polymorphs for each molecular structure (∆T m ), and histogrammed the distribution of MP intervals in Figure 2b). To compute this interval, we take the difference between the maximum and minimum recorded MPs for polymorphs of a given molecule. The data in We observe that over half of the examined molecules would exhibit polymorph-related predictive errors of less than 10 K, which is an encouraging result for QSPR predictions.
Moreover, over 80 percent of molecules would have their errors bounded by 20 K, though it is somewhat troubling that for some polymorphs MP variations as large as 57 K have been measured. However, the fact that 96 percent of molecules exhibit a MP interval of less than 30 K, along with the fact that only a fraction of the molecular structures in a data set will exhibit multiple crystal polymorphs, suggests that crystal polymorph induced inaccuracies are likely not the sole feature limiting the performance of MP prediction. The first and second moments of the simple crystal polymorph MP distribution correspond to a range of ∼ 11 − 16K, which we tentatively use as a lower bound for our expected error due to polymorphs. Note, that the previously obtained intervals of 35-50K MP prediction accuracy are significantly worse than that derived from the simple analysis of Figure 2

Experimental and Chemical Uncertainties
MPs are typically measured via either a melting point apparatus (MPA) or differential scanning calorimetry (DSC) experiment. Generally speaking, if properly calibrated and performed, both DSC 43 and MPA (https://www.thinksrs.com/products/mpa100.html) measurements should exhibit reproducibility of ∼0.1 C. For the specific case of DSC, one observes a peak in the melting curve from which a specific MP must be derived. ICTAC standards state that one should take the onset of the melt peak as the MP for metals and organics, but that the peak value should be used for polymers. 44 However, even with these considerations, if properly performed, the majority of pure organic materials typically exhibit melting ranges of 1-2 C for a given material.
To properly perform either MPA or DSC experiments, the heating rate must be appropriately chosen. Typical heating rates for both MPA and DSC, depending on the precision required, are between 0.1 C/min and 20 C/min, with most high precision studies occurring at heating rates less than 1 C/min. Despite the majority of DSC peak widths for small organic molecules being 1-2 C, one can establish a generous upper bound for potential experimental error in MP by examining the literature of the heating-rate dependence of macromolecule MP, where heating-rate effects should be largest. For the case of crystalline polyethylene, the MP decreases approximately 6.5 C when the heating rate is increased from 0.6 C/min to 20 C/min. 45 With this information in mind, if these experiments are performed for pure samples, with appropriate heating rates, and the MP value is taken at the onset of the melt peak, DSC and MPA measurements should yield measurement errors less than 1-2 C, with a polymer-derived maximum bound of roughly 6 C. We emphasize that these arguments are back-of-the-envelope calculations, but believe them to be in agreement with common experimental experience.
Sample purity is an orthogonal issue that can contribute to the inaccuracy of MP measurements. Indeed, in many cases, MP measurements are meant to assess sample purity by identifying an increase in the MP relative to a known pure sample. If a material has degraded in storage or during the experiment, then such purity issues will induce errors in the experimental MP. Tetko 13 performed analysis of molecular structures exhibiting high MP prediction error and concluded that functional groups capable of decomposition during storage/heating were significantly more represented in the set of outlier compounds relative to the rest of the data set. The magnitudes of these errors are entirely dependent upon the identities of the impurities, and so we refrain from generally speculating on their magnitudes.

Errors in Data Recording
Tetko 13 provided an in-depth analysis of outliers in their 45,000 molecule data set. Specifically, for the OCHEM and Enamine subsets, 394 and 427 outlier compounds were identified, respectively. These outliers corresponded to RMSE prediction errors > 130 C. Their analysis determined that 71 of the outlying compounds exhibited MP of less than room temperature, and consequently were likely not measured correctly. In the case of the OCHEM subset, three outlying compounds misreported MPs for the salt form of a compound, three cases reported the wrong temperature units, and two cases misrecorded a minus sign. Upon removal of these outliers and comparison to other literature values of MP for questionable data points, this screening improved their predicted RMSE significantly.

Passive Sampling Data Analysis
With anticipated experimental uncertainties established, we analyze the statistics of the MP data sets, as shown in Figure 2a). The Enamine data set exhibits a higher mean MP relative to other data sets, resulting in a positive skew as observed by a long tail of the histogram at higher temperatures. Enamine's mean MP is also closest to that for the drug-like region (i.e. 423 K). As noted by Tetko et. al., 13 the Enamine data set was generated using identical experimental protocols for all analyzed molecular species. The MP distribution statistics reveal that Enamine also has the smallest standard deviation among the analyzed data sets. The OCHEM data set is an aggregation of a variety of diverse data sources obtained with different experimental protocols and measurements and exhibits a long tail at low temperatures (i.e. negative skew). The large standard deviation of the OCHEM data set relative to Enamine can likely be attributed to the heterogeneity of sources and measurement protocols as reported by Tetko. 13 The well-curated nature of the BradBerg data set implies that the large standard deviation observed in the distribution in Figure 2a) is due to the inherent diversity of chemical structures and MPs in the data set. Combining all data sets resulted in a MP distribution which has characteristics shifted closer to OCHEM (i.e. negative skew, mean and standard deviations closer to OCHEM).
In supervised ML, an algorithm is trained on a data set and validated on a held-out test data set. The most common way of creating these training and test data sets is via passive sampling, where the original data is randomly split into groups without regard for the underlying statistical nature of the data set. The regression results for different supervised ML models trained on the passively sampled MP data sets of Figure 2a) are shown in Table 1. We utilize the RF ML method as a benchmark for initially comparing the mean absolute error (MAE) metrics among the individual and combined data sets. The predicted MAE validated on individual test sets follows the trend OCHEM > BradBerg > All > Enamine. We observed a clear correlation between the relative predicted MAE of OCHEM and Enamine and their associated standard deviations. Interestingly, the BradBerg data set exhibits a low MAE and high value of the correlation coefficient, which we attribute to the highly-curated and chemically-diverse nature of the dataset, the latter of which is confirmed by its large standard deviation. These results emphasize the critical importance of having well-curated data sets, as in the cases of data sets that are not well-curated, including more data will not lead to better model performance.
The use of more advanced supervised ML algorithms, including GCN and GPR, on the passively sampled data sets lead to significant improvements in predictive accuracy. Specifically, for both GCN and GPR, MAE below 30K can be achieved for the entire data set using both methods, with MAE of 28.9 K and a correlation coefficient of 0.78 obtained when using the GPR method in conjunction with a feature set containing both 3D and quantumchemically derived descriptors. If one restricts the performance of the GPR method to only molecules in the 'drug-like' interval as described by Tetko, 13 we can obtain a cross-validated MAE of 25.8 K in the drug-like region. It is interesting that the GCN method, which does not include any quantum-chemical or 3D structural information, can obtain MAE below 30 K solely from the details of the graph structure derived from the molecular SMILES strings, a result that is in agreement with recent GCN work. 46 This points to the promise of graphbased techniques that have been described previously, 27,47 especially provided these methods do not require the additional cost of conformer searches or quantum-chemical analysis to generate ML features. However, we do observe an improvement in predictive performance relative to the GCN when utilizing the GPR methodology and including both 3D structural information and quantum-chemically derived properties (solvation energy plays a reliable role in reducing the predicted MAE, as described later on). Additionally, the GPR framework provides an assessment of prediction uncertainty, which is desirable for MP prediction, especially if one is unsure of the chemical similarity between a new molecule and the model's training data set.    Only parent chemical classes with at least 5% of the total fraction are shown; parent chemical classes with fractions smaller than 5% were merged in to 'Others'. Cluster indices are the same as defined in Table 2.

Fine Grained Butina Clustering
To investigate the ability of active sampling to create a chemically-aware data set for use with supervised ML, we apply the Butina clustering method to create a new data set ('Butina 0.6'). The MPs corresponding to the 33,408 compound data set (13,974 molecules removed) have been visualized in Figure 2a). It is clear that the fine-grained clustering generates a data set whose distribution of MPs is qualitatively similar to that of the parent 'All' data set, and selection of the new data set by the unsupervised clustering did not simply prune outlier MPs at the wings of the distribution. The Butina 0.6 data set is composed of 77% BradBerg, 72% Enamine, and 68 % OCHEM.
For training of the supervised ML algorithms, a 70/30 split was applied to the Butina 0.6 data set with the regression results shown in Table. 3. We begin by comparing the RF regressor results for the Butina 0.6 data set with respect to the passively sampled data sets ( Table 1). The total regression error falls below that of both OChem and Enamine individually, despite still containing nearly 70% of each data set, without clipping outliers at high or low temperatures. The improved performance of the supervised ML on the actively sampled data set relative to the passively sampled data sets is further supported by the performance of more advanced regression methods, as shown in Table 3. RF exhibits the largest increase in predictive accuracy of ∼ 5.2K MAE with respect to OCHEM. Contrastingly, the GPR and GCN exhibit 1.7 K and 2.3 K improvements in predictive accuracy, respectively.
The differences in improvements are likely due to the complexity of the supervised learning methods and the differences in the featurizations used. This supports the notion that more complex ML and featurization methods (GPR and GCN) are more effective at extracting relevant details during the learning process, even from the passively sampled data, relative to the simpler RF method. This is further supported by the performance of the Butina data set compared to that of the 'All' -for GCN and GPR, predictions are essentially identical, whereas for RF a noticeable 2 K improvement is observed. Consequently, in cases with limited data and or less-sophisticated regression methods, active sampling of chemical space should be a reliable strategy for modest improvements in data sets where chemical space is not designed to be uniformly sampled. For the majority of data sets, particularly those that are experimentally derived, this uniformity of chemical space is not a priori anticipated.

Inclusion of 3D Structures and Quantum-Chemical Properties
Following the use of the actively sampled data set relative to the passively sampled data set,  However, the E3FP fingerprints that also include 3D structure result in a significant improvement relative to the Morse 3D. with the performance of the 3D fingerprints still being comparable to those of the 2D fingerprints used in conjunction with the RF model. Interestingly, we observe that results are similar when using DFT optimized geometries, force-field optimized geometries, or the resultant geometry from a low-energy conformer search. This suggests that knowledge of the single molecule conformation, as well as precise details of intramolecular geometric structure, are less critical to MP prediction than a rough description of the molecular geometry/connectivity. To this end, while the knowledge of the exact crystal structure would likely be critical to predicting polymorph-specific MP, the precise single molecular geometry does not appear to improve MP prediction significantly.
The best performance of all descriptor combinations, including graph-based models, is observed when using a diverse feature set that includes 3D descriptors, quantum-chemically derived data, and the RDKit feature set described in the SI. These lowest MAE values in GPR are observed when combining 2D and 3D descriptors with RDKit features and quantum-chemistry data, and lead to the highest performance -all of these combined is referred to as COMBO in the Figure 4 B); it is worth noting that 2D descriptor plus RDKit features provide the most important contributions for better predictions.
The peak performance observed using GPR and a diverse feature set that includes 3D structure, quantum-chemical descriptors, and RDKit descriptors should be weighed in conjunction with the computational cost of generating such featurizations. In Table 3, GPR results in a ∼ 1K reduction compared to graph-based methods, however, the graph-based methods do not require any knowledge of 3D structure or the expense of quantum-chemically derived feature sets. Consequently, while the inclusion of these properties leads to the highest performing models, graph-based ML methods are likely the path forward to obtaining the highest-performing predictions with the least cost for feature set generation.   to the data distribution of the training set. If a molecule is outside the general cluster of data points, it could mean the model is observing patterns not contained in the training set.

MP Structured Geometric Spaces
One feature to note in Figure 5 d) is that the density of molecules is encased in a circular area. This is due to the prior of the SSVAE; each dimension is assumed to be Gaussian distributed, which reflects itself as data lying on the surface of a hyper-sphere which when projected on 2D corresponds to a circle. One notable result of the graph embedding space is that the notion of chemical clusters is easy to visualize even if this information was not explicitly provided during training. The SSVAE latent space did not exhibit this feature, but if provided during training would likely result.
In Figure 5 b) we combine graph embedding with the unsupervised clustering derived from the previous section to understand how the clusters discovered by the unsupervised clustering compare to the organization of the latent space observed via graph embedding.
Encouragingly, the chemical distinctions encoded in the unsupervised clustering show up prominently in different regions of the PCA components for the graph embedding in Figure   5 b), further demonstrating that the structuring of chemical latent space by the MP corresponds with distinct chemical classes derived from the use of the unsupervised clustering algorithms. Consequently, the graph embedding is organizing the latent space not only in a way that correlates with MP, but in a way in which chemical locality is preserved.
To showcase the usage of these spaces, we sample from the latent space and decode these vectors into SMILES strings. This decoding process produces valid structures with a rate of 73% percent. Since the unsupervised component of the SSVAE is trained on purchasable molecules, we expect the newly sampled structures to resemble plausible molecules. Because this space is organized around MP, we can sample new structures based on predicted MP. In  with molecular changes will also aid in the design of new materials.
With these considerations in mind, it is useful to consider the future state of such pre-

Supporting Information Available
Computational dataset generation We augment the original data sets of Tetko by including a variety of structural and quantumchemical descriptors. The generation of these quantities begins with a list of SMILES strings and MPs downloaded from the OCHEM website. 26 RDKit 51 was used to convert SMILES strings into 3D structures using random initial coordinates, upon which Hydrogens were added and UFF energy-minimizations were performed. The minimized geometries were then used to seed B3LYP/6-31G** geometry optimizations in Gaussian. 52 The LANL2DZ pseudopotential and basis set were used for Iodine-containing molecules in the data sets. From the energy-minimized DFT geometries, total energy, HOMO/LUMO energies, SMD Solvation energy, 53 dipole moment, quadrupole moment, wavefunction extent, and non-electrostatic energies were extracted. SMILES strings were also converted to Morgan Fingerprints. The quantum-chemically derived data sets used are available online as .json files.
In addition to the random coordinate generation and UFF minimization that seeded quantum-chemical calculations, we performed a 3D conformer search exploring rotatable bonds and testing cis and trans isomers. 54 For each compound we produced and minimized 1500 conformers using RDKit with the MFF94 force field, and the sets of local minima were clustered to obtain a set of diverse and lower energy conformers. These conformers were used to generate 3D Morgan fingerprints for use with the supervised ML methods. A standard suite of 2D descriptors found in RDKit were also computed for all molecules in the data sets, as detailed below.

Grad-CAM with GCN
To obtain importance weights for task y, Grad-CAM computes the gradient of y with respect to the activations of a GCN hidden layer which we denote as A(node i ), i.e. ∂y ∂A(node i ) . These gradients flowing back are global pool averaged across all nodes to obtain importance weights α k for each dimension of A(node i ) ∈ R K . Using these weights for a weighted summation across the activations we arrive at an expression for Grad-CAM: To improve the interpretability of the weights, these can be l 2 normalized and also passed by a ReLU function to only consider positive values.

Active sampling details
The workflow deployed to achieve the outlined goals is shown in Fig. 1

Semi-supervised Variational Autoencoder
We base our VAE architecture on the implementation found in the MOSES generative benchmark. 58 The encoder is a single layer GRU with a hidden dimension of 256 and a dropout of 0.25, while the decoder is a three layer GRU with 681 dimensions and 0.25 dropout.
Decoding is a harder process than encoding and this is reflected in the complexity of each

Gaussian process regression
We performed Gaussian Process Regression (GPR) implemented in GPmol, which is based on GPflow. 68 The co-variance matrices in GPR were produced using the Jaccard index as a distance metric between vectors produced from fingerprints and descriptors. We used a 2D Morgan circular count (ECFP-c) from SMILES strings produced with a bit size of 2048 and radius 4. 3D descriptors were created using Morgan 3D fingerprint (E3FP) 69