Machine-Learning-Based Prediction of the Glass Transition Temperature of Organic Compounds Using Experimental Data

Knowledge of the glass transition temperature of molecular compounds that occur in atmospheric aerosol particles is important for estimating their viscosity, as it directly influences the kinetics of chemical reactions and particle phase state. While there is a great diversity of organic compounds present in aerosol particles, for only a minor fraction of them experimental glass transition temperatures are known. Therefore, we have developed a machine learning model designed to predict the glass transition temperature of organic molecular compounds based on molecule-derived input variables. The extremely randomized trees (extra trees) procedure was chosen for this purpose. Two approaches using different sets of input variables were followed. The first one uses the number of selected functional groups present in the compound, while the second one generates descriptors from a SMILES (Simplified Molecular Input Line Entry System) string. Organic compounds containing carbon, hydrogen, oxygen, nitrogen, and halogen atoms are included. For improved results, both approaches can be combined with the melting temperature of the compound as an additional input variable. The results show that the predictions of both approaches show a similar mean absolute error of about 12–13 K, with the SMILES-based predictions performing slightly better. In general, the model shows good predictive power considering the diversity of the experimental input data. Furthermore, we also show that its performance exceeds that of previous parameterizations developed for this purpose and also performs better than existing machine learning models. In order to provide user-friendly versions of the model for applications, we have developed a web site where the model can be run by interested scientists via a web-based interface without prior technical knowledge. We also provide Python code of the model. Additionally, all experimental input data are provided in form of the Bielefeld Molecular Organic Glasses (BIMOG) database. We believe that this model is a powerful tool for many applications in atmospheric aerosol science and material science.


INTRODUCTION
The glass transition is a nonequilibrium phase transition. Kinetically, the glass transition temperature T g is defined as the temperature at which the viscosity of a substance reaches a value of about 10 12 Pa s. This transition from a liquid to an amorphous solid state is accompanied by a change in heat capacity that can be detected experimentally, e.g., by differential scanning calorimetry. 1 Due to its nonequilibrium nature, the glass transition temperature depends on the thermal history of the material, thereby impeding the comparison of experimental T g values and their theoretical prediction. 2 Over the years, several surrogate methods have been developed for predicting the conditions for a glassy state and its transition temperature for molecular organic compounds. 3−5 A long-known and simple, but surprisingly reliable, method is the Boyer−Beaman rule 6,7 that formulates a proportional relationship between the glass transition temperature T g and the melting temperature T m of a substance. Further studies have elaborated that the T g /T m ratio is ∼0.7 for a large variety of substances. 8 Nevertheless, this approach is sometimes considered to be only a rule of thumb, in light of the statistical deviation of the various data from the prediction (1σ: ±21 K, 2σ: ±42 K). 8 When it comes to theoretical frameworks that function based on historical data, so-called data-centric models, machine learning algorithms are the state-of-the-art tools. These experience-driven models have already shown to yield promising results in context of predicting the glass transition temperatures of inorganic oxide glasses; 9,10 however, only very few attempts have been made to apply them to molecular organic compounds. 11,12 The latter compounds are of great interest for atmospheric science, because they exist in the atmosphere, for example as components in organic aerosol particles. 13 Moreover, the glass transition temperature is crucial for our understanding of organic aerosols, as it affects, for example, cloud formation processes or diffusion-dependent inparticle processes during gas uptake and chemical reactions. 14,15 Within the past decade, the application of machine learning methods in natural sciences has experienced a rapid growth. Most models that were developed in this context are used either for the discovery of new systems 16,17 or for the prediction of a specific property. 18−22 In 2014, Alzghoul et al. 11 published a study in which machine learning algorithms (mainly support vector machines and neural networks) were used to predict the glass transition temperature of organic molecular compounds. One limitation of their work is that only a rather small data set of 71 druglike substances (mostly functionalized heterocycles) was considered. Thus, their model is only applicable to this specific group of substances, which lacks the representation of more common and simple molecules. In addition, no possibility to access and use the model for further calculations was provided. Tao et al. 23 investigated the performance of over 70 different machine learning models on a polymer T g data set and observed that a random forest algorithm performed best. 23 Even more recently, Galeazzo and Shiraiwa 12 introduced a "tgBoost" model for T g predictions, which was built on a larger experimental data set of 298 substances (mainly obtained from Koop et al. 8 ). This model employs extreme gradient boost regression along with molecular descriptors based on the SMILES notation to predict the T g of organic molecules. The model code is accessible through a public repository, thereby allowing the implementation in future research activities. We note that its application requires prior knowledge on the installation and usage of Python (and its packages) as well as other prerequisites, which may be a drawback to some researchers unfamiliar with these tools.
In the work described in this article, we present several variants of a machine learning model that achieve an improved performance by combining molecular descriptors and melting temperature. The model operates on an extremely randomized trees (extra trees) algorithm that is substantially different from the algorithms used in the previous models. Furthermore, we provide an online version of the model in the form of a web site, thus ensuring a user-friendly environment where everyone can use the model without any prior technical knowledge to predict T g of organic compounds. We also provide Python code for model applications through a public repository.

Machine Learning.
In general, machine learning (ML) aims at training an algorithm to solve a specific problem given the independent x-values (features) and the corresponding y-values (labels) of a predefined data set. 24 ML is used to describe complex multivariant dependencies beyond the capabilities of conventional fitting methods. In case the output data is categorical, i.e., it can only adopt discrete values, for example, the categories of "cats" and "dogs" in image classification, a classification problem is faced. If the label is continuous, as in our case the glass transition temperature T g , then a regression problem needs to be solved. In both cases, the model should have learned about the data set's patterns and dependencies in a way that it is able to predict the yet unknown output label for a set of input features that was not included in the original training data set.
2.1.1. Decision Tree Based Algorithms. The algorithm that performed best in our study uses a decision tree 25 as its estimator. Resembling a real tree, in a decision tree multiple nodes are connected by branches. The final nodes are called leaf nodes, while all the others are termed decision nodes. The very first node (root node) contains the whole data set. At each decision node, a feature-specific statement checks whether or not a sample exceeds a certain threshold (e.g., X 1 ≥ 2). Such decisions split the data set and pass the splits (children) to the following nodes. In the case of a regression problem, the best split is chosen by maximizing the variance reduction of the parent and child nodes, i.e., by creating child nodes that have lower variances than their parent. Ideally, this procedure should result in low-variance leaf nodes, without overfitting the tree, i.e., strictly memorizing the training data and losing the ability to generalize. This tree-development procedure is controlled by various parameters such as the depth of the tree, i.e., the number of layers. Finally, the predicted value is obtained by forming the mean of all the values in one leaf.
The random forest algorithm 26 consists of multiple individual decision trees and is, therefore, an ensemble method. 27 Such an ensemble of trees is more powerful than a single decision tree, because of the uncorrelatedness of the different trees. The trees become uncorrelated by assigning each tree a different subset of the data instead of providing each tree the entire data set. The subset is randomly picked from the entire data set with replacement, which allows the subset to contain duplicate samples. This method is known as bootstrapping. Another way to uncorrelate trees is the random subspace method, where instead of giving each tree access to all the features, each tree can only use a random subset of features. Each tree will then make an individual prediction and subsequent averaging over the predictions of all the trees' results in the forest's final output.
The extra trees algorithm 28 relies on the same concept as the random forest, but has two major differences. First, the subsets are created without replacement and, second, instead of searching for the best split by maximizing the variance reduction, only a few random splits are considered and the best one among those is chosen. Generally, the random forest and extra trees algorithm are believed to perform equally well, but extra trees seems to outperform random forest when noisy features are present. 29 Another advantage of the algorithms described above is that they represent "explainable" ML algorithms. For some time, ML was criticized for being mainly used as a black box tool, as algorithms got too complex to be understood by users of other fields, as supported by a recent online survey. 24 However, the principles of decision trees are relatively easy to understand and with some further research more profound knowledge can be gained. It is possible to inspect individual nodes and trees in order to follow the work flow, thereby enhancing the reproducibility and confirmability.

Training and Test Sets.
In contrast to conventional fitting techniques, employing the whole data set to train a model is not recommended for ML applications, because more sophisticated algorithms can become so efficient at predicting the training data that no reasonable assessment of the model's performance is possible anymore. For this reason, a minor fraction of the data set should be set aside for testing the model appropriately and typically 80−90% of the data are used for training. 9,10 In the current study, 90% of the data were used to train the model, whereas 10% were used to test it afterward.
The actual splitting of the data into a training and a test set was done randomly using a pseudorandom number generator (PRNG). 30 The PRNG is initiated with a seed, which is an arbitrarily chosen number. According to the seed, a set of random numbers is generated. A specific seed will always create the identical set of random numbers, which is important when it comes to comparing the results of different training runs. If a true random number generator was used at this point, then there would be splits with different samples for training and test data in each run, which would not allow for a comparison of the different runs. The PRNG makes it possible to control the randomness and receive reproducible results of data set splits.
An important step in model development is to choose an appropriate ML algorithm. The above explanation underlines that the performance of any algorithm will depend on the randomly assigned training and test sets. However, when looking for a more general statement of how an algorithm will perform "in the wild" on unseen data, different splits of training and test sets have to be explored. This is typically achieved by the execution of the so-called k-fold cross-validation. 31,32 In this method, k is an integer number that represents the number of equally large segments, into which the data set is split. Thereafter, a total of k runs will be performed. For every run, one of the k data segments will be used as a test set, while the other segments are combined and used as a training set. The model performance is then assessed by a metric score, e.g., the mean absolute error. This procedure is repeated, but each time with a different data segment as the test set. Finally, after completing k runs, each data segment was used once as the test set and k − 1 times in the training set (see Figure S1 for illustration). The cross-validation score is obtained by calculating the mean of the k individual metric scores of each run. In this work, a 10-fold cross-validation (k = 10) was applied to investigate, which of the algorithms performs best among the various ones tested. The metric we chose for our evaluation is the mean absolute error (MAE): Here, T g i , pred is the predicted glass transition temperature value of compound i, T g i , exp is the experimental T g value of compound i and N is the number of samples in the test set.

Feature Representations.
The objective of this work was the robust prediction of the glass transition temperature of organic substances with a low MAE, implying that T g is the label. As features, numerical forms of molecular descriptors are required that characterize a particular substance distinctly. These kinds of vector representations are called fingerprints 33 or feature vectors. For our purposes, we developed several different approaches, which we term modes, and we examine each of these modes in detail below.
2.2.1. Functional Group Mode. The first approach builds on a series of chemical functional groups and uses the number of appearance in the molecule of each of these functional groups as well as other molecular properties to build the feature vector used by the algorithm. Working with functional groups is a principle that was already applied successfully by numerous previous works, for example, in UNIFAC-type models for describing thermodynamic properties in multicomponent liquid mixtures. 34−36 Such functional group-based features can be divided into direct inputs, which users enter by themselves, and indirect inputs that are calculated autonomously by the model. The functional groups considered in our model are methyl (CH 3 ), methylene (CH 2 ), methine (CH), carbon atoms that are not bonded to hydrogen atoms (C), hydroxyl (OH), ether oxygen (−O−), carbonyl oxygen (�O), nitrogen atoms (N), and halogens (Hal). Additionally, the following features were included: The atomic oxygen-tocarbon (O/C) ratio of the molecule, its molar mass (M), and the double-bond equivalent (DBE) calculated according to the following equation: where n i is the number of atoms present in the molecule and v i is their valence, defined as an atom's number of regular bonding partners. 37 Finally, the melting temperature T m can also be included as an optional feature, see Table 1 for an overview of all the features used here. Altogether, these features form a feature vector that contains the molecular information and that can be passed into a ML algorithm. We chose these particular functional groups in order to describe a molecule as precisely as possible, while at the same time ensuring a relatively user-friendly input. While this feature vector representation still has some limitations (see section 2.3 below), it is much more distinct than just simply using the chemical formula, for example.

SMILES Mode.
The second approach used in this study employs a SMILES string as an input. SMILES is the abbreviation for Simplif ied Molecular Input Line Entry System, which is a widely used method for capturing the chemical structure of a molecule as a series of characters, a so-called string. The SMILES string of a particular molecule is transformed into a fingerprint by a so-called featurizer. In the Python programming language, the RDKit package 38 is required to handle SMILES strings. Along with the DeepChem package, 39 the featurizer RDKitDescriptors is used here. It generates a 208-bit vector, in which every bit refers to a certain molecular property or structural information. One restriction of this kind of representation is that stereochemistry can not be resolved, even when formally declared in the SMILES string, because different stereoisomers result in the same RDKitDescriptors vector. Since the stereo information would not have

ACS Omega
http://pubs.acs.org/journal/acsodf Article any benefit, it was therefore neglected in the SMILES code, thus using the canonical SMILES string that does not consider stereo configuration. We note that the loss of stereo information is only a minor drawback at this point, because for the evaluated samples, the glass transition temperatures of configuration isomers and diastereomers is not expected to differ significantly, as evidenced by exemplary experiments. 40 Moreover, there are only very few literature sources, in which any clear distinction between such stereoisomers has been made when providing experimental T g data of molecular compounds. Finally, also in the SMILES Mode we added the melting temperature as an optional feature, as it turned out that it significantly enhances the performance of ML predictions in this mode, see further details below. A schematic drawing of the work flow in both of the employed modes is depicted in Figure 1.

Data Set.
For this work, experimental glass transition temperatures of organic molecular substances were required as a training set. The majority of the data used here was taken from a data set collected by Koop et al. 8 In that study, a total of 596 T g values of various substances were analyzed, of which about 480 originate from organic molecular compounds, and these were used in the current study. The other major data source arises from the supplement of Rothfuss and Petters, 41 who also collected a very similar type of data set. Note that we carefully looked up each reference and excluded those data, which were already obtained from our first source. 8 Finally, the data set was enriched with further experimental T g data obtained by differential scanning calorimetry measurements in our laboratory. 40,42,43 It is worth mentioning that we exclusively used experimental data and did not include any theoretically derived T g data, for example from molecular dynamics simulations or similar methods. The only exception is that for 20 compounds, for which experimental T g data were available, we did not find any experimental T m data. Hence, in these cases, the melting temperatures were approximated with help of the Joback method. 44 All experimental input data used in this study are provided in form of the Bielefeld Molecular Organic Glasses (BIMOG) database in the Supporting Information and in a public data repository (DOI: 10.5281/zenodo.7319485).
The molecular compounds in our data set consist primarily of C−, H−, O−, N−, and halogen atoms. There are very few compounds with sulfur atoms, but they are so underrepresented in the data set that we do not recommend to use the model for such compounds. The same is true for relatively large molecules (>600 g mol −1 ): While there are some compounds in the data set with larger molar mass, there are too few to justify a generalization. One problem that arose at the early data assessment stage was that quite often multiple experimental T g data are available for the same substance, originating from different literature sources, methods or experimental conditions, e.g., cooling and heating rate. In a previous study on the prediction of glass transition temperatures of polymers with a ML algorithm, this issue was addressed already. 45 That study revealed that in all cases analyzed, using the median of the glass transition temperatures of multiple-reported values leads to best results as measured by standard metrics such as the root mean squared error. For this reason, the same approach was followed in this work when encountering such "duplicate" samples. A detailed description of the data preprocessing can be found in the Supporting Information.
One drawback of the feature representations described above is that they are not always capable of describing a molecule unambiguously. For example, in a few cases either the constitution of the molecule or its configuration becomes ambiguous, implying that these originally distinct molecules are represented by the same feature vector; thus, they are treated as the same substance. As discussed above, the configuration is apparently not very important, but the importance of constitution can not be assessed easily. There are cases where the difference in the glass transition temperature of constitutional isomers is only minor, as in the cases of 2-pentanol (T g = 140 K) and 3-pentanol (T g = 143 K). 46 On the other hand, there are isomers that differ significantly in their glass transition temperature, for example, sucrose (T g = 334 K) and trehalose (T g = 385 K). 8 As a result of these restrictions, the data set shrinks in terms of unique x-values and at the same time there are multiple yvalues for some x-values. Figure 2 visualizes this effect by showing the unified experimental data used in this study (gray bars) consisting of 355 entries. Applying the SMILES Mode feature representation reduces the original data to 330 unique entries (violet bars), because the RDKit descriptors do not resolve the stereo configuration, which is the information that gets compromised at this point. The Functional Group Mode feature representation is less powerful at providing a unique definition of a particular molecule, and, thus, only 286 unique entries remain for this representation (pink bars). Generally, a continuous uniform distribution of the label input data would be desired here to allow for a balanced representation over the entire data range, and any strongly unbalanced distribution would lead to an over-representation of the label in the data range close to the maximum. The distribution in Figure 2 has some gaps at the ends and in the middle, but overall it is relatively balanced. We note that the size of the data set is rather small compared to typical ML applications, but previous studies have already shown that also small data sets can yield promising outcomes. 47,48 Another interesting point concerning the duplicate feature values can be seen in Figure 3, in which the deviation of the individual duplicate glass transition temperature values T i g (all belonging to one feature vector) from their median value T g med is depicted. As noted above, such duplicated values originate because the same substance was actually measured multiple times or because the feature representation of different substances is the same. For the Functional Group Mode ( Figure 3A) as well as the SMILES Mode ( Figure 3B), the data obey a normal distribution centered around the zero value, most likely due to the fact that most of the values are doublets. Often, these doublets come from the same research groups which measured a substance twice and/or reported the identical value in two different articles published a few years apart from each other. The distributions shown in Figure 3 thus support the procedure described above of using the median T g value for such doublets.

Parametrizations for the Prediction of T g .
To compare the predictive power of the ML models developed in this work with well-established literature methods, three alternative models were used for comparison: First, the Boyer−Beaman rule 6,7,49 (eq 3), which predicts T g based on the melting temperature T m of a substance: Here, T m is the melting temperature and g is a constant, whose value was found to be approximately 0.7 (1σ: ±21 K, 2σ: ±42 K) according to a previous analysis. 8 Second, we used the parametrization by Shiraiwa et al. 3 (eq 4), which is based on the molar mass and atomic oxygen-to-carbon (O/C) ratio of organic compounds: Here, #C, #H, and #O are the number of carbon, hydrogen and oxygen atoms present in the compound. The values of all the other parameters are given in Table 2. We note that the parametrizations presented in eqs 4 and 5 were designed for compounds consisting only of C, H, and O atoms, i.e., CH and CHO compounds.
More recently, Li et al. 5 provided a parametrization for calculating the glass temperature of CHON compounds that also include nitrogen (N) atoms:

Feature Importance in the Functional Group
Mode. Initially, we applied different decision tree-based algorithms to a subset of the data that contained only CH and CHO compounds, in order to allow for a better and fair comparison to the literature parametrizations (eqs 3−5) that were only designed for this group of compounds. From now on, we will refer to this subset as the "CHO data set". Figure 4 illustrates how important the analyzed algorithms consider each feature of the Functional Group Mode for the CHO data set. These values were obtained by claiming the algorithms' intrinsic attribute "feature importance" and were computed as the normalized total variance reduction. As already explained in section 2.1.1, the estimators in the random forest (green bars) and extra trees (purple bars) do not necessarily get full access to all the available features/samples, when compared to a single decision tree (orange bars). The random forest estimators then search for the best split among all possible splits.
All three algorithms have in common that the top three features are the melting temperature T m , the molar mass M, and one of the oxygen-related features such as OH or O/C, some of which have already been identified as key parameters for the glass transition temperature previously. [3][4][5]8,41 This observation is consistent with the fact that they have a high correlation with T g and have been used already in previous parametrizations, i.e., eqs 3−5. While the observation that T m , M, OH, and O/C are the most important features is not surprising and not a new result in general, it nevertheless corroborates that also the ML model automatically identifies these strong correlations of physical significance. One may argue, however, that including indirect features such as T m , M, or O/C (see Table 1) is an unphysical self-fulfilling prophecy. Therefore, we also ran an extra trees model that was trained without the T m , M, and O/C features and analyzed its feature importances. Those values were then compared to the original model that was trained with all the features, but had the feature importance for T m , M, and O/C removed and subsequently renormalized the remaining relative importance feature values (see Figure S3). That comparison indicates that the removal of the features does not fundamentally change the model and only very slightly reduces its predictive power (MAE of 13.1 K is increased to 14.1 K when the indirect features were removed). Moreover, the relative feature importance distribution among the functional groups is hardly affected, with the number of OH-groups being the most important feature (see Figure S3 and related discussion). This outcome can be explained by the fact that OH-groups contribute to the formation of intermolecular hydrogen bonds, which are an essential property contributing to the viscosity and, hence, also the glass transition temperature of organic compounds, as noted previously. 8,41 In conclusion, the feature importance comparisons of Figures 4 and S3 reveal that the underlying physical chemistry is not compromised by including the indirect features T m , M, and O/C into the Functional Group Model, but very slightly improves its predictive power.
In addition, we checked whether our model produces a physically meaningful behavior when predicting T g of various chemical compound classes such as n-alkanes, n-alcohols, diols and triols as a function of the number of carbon atoms of the molecule (see Figure S4). We observed that T g generally increases slightly with the number of carbon atoms within a   compound class, and T g strongly increases with the number of OH-groups at constant number of carbon atoms, consistent with the results of prior studies. 12,41 We also considered the problem of multicollinearity caused by quasi redundant features, which is a known problem in ML. 50,51 For that purpose, we calculated a correlation matrix for the feature−feature correlations. Features are considered as highly correlated when their correlation coefficient is greater than 0.9, 50,51 however, that was not the case for any of the features (see Figure S2 and related text).
As mentioned above, the k-fold cross-validation is an effective method for evaluating different training and test sets and the mean absolute error (MAE) is an appropriate metric for comparing different models or algorithms. The comparison of the three Functional Group Mode algorithms in a 10-fold cross-validation revealed that the extra trees regressor reached the best score, i.e., the lowest MAE compared to the random forest and decision tree algorithm. For this reason, we chose the extra trees regressor algorithm as the core of our ML models described in more detail below. Figure 5 shows the T g values predicted with the Functional Group Mode (FGM) versus the experimental T g data. The line of unity (solid black line) represents a perfect prediction and the dashed lines indicate deviations of ±15 K from the perfect centerline. In Figure 5 we compare the results of our ML model operating on extra trees regression (magenta squares) with those of the previously introduced literature parametrizations (cyan diamonds, yellow circles, and blue triangles) on randomly picked test sets. We note that the data of the test sets were not included in the training of our ML models. Panel A shows the predictions of Functional Group Mode on the CHO test set, and panel B shows those on the NHal-extended test set. At first glance, the magenta squares (ML model) are closer to the line of unity than any of the other types of symbols, indicating a good description of the experimental data by our model. In both panels, only a few outliers are observed, which are not located within the corridor of ±15 K of the two dashed lines. Another point worth mentioning is that even though the melting temperature is the most important feature, there are a few samples where the predicted value from the Boyer− Beaman rule (cyan diamonds) is significantly off the center line, while the magenta square of the ML prediction is closer to the experimental value. This behavior implies that the ML models do not solely rely on the melting point dependency but also take other features into account and thereby adjust their weights for different chemical structures. In the discussion of Figure 4 above we pointed out that T m is the most important feature in the ML models, which seems reasonable given its very high correlation with T g , as evidenced by the Boyer− Beaman rule (eq 3). However, a T m value may not be available for every substance for which the prediction of T g is desired. Therefore, we trained two ML model variants, one with and one without the melting temperature as a feature; see more detailed discussion below.

Functional Group Mode.
Furthermore, to make a meaningful comparison of the models, a nested cross-validation was performed. This evaluation method gives a reliable estimate of how a model performs on different test sets. This nested cross-validation (ncv) works just like a normal cross-validation as described in section 2.1.2, but with the difference that for each cv-training set a hyperparameter tuning is conducted. This process returns the best

ACS Omega
http://pubs.acs.org/journal/acsodf Article model parameters among a predefined selection of possible parameter values (also known as a grid search cv). Thereafter, these best parameters are used to make the prediction on the cv-test set. Once again, the MAE (eq 1) was used as a metric score. Since a nested cross-validation can only be performed for the ML algorithms, we calculated the MAE for the literature parametrizations (eqs 4 and 5) using the entire data set, which we believe is a fair and valid comparison. The resulting MAEs are given in Table 3. As highlighted with bold text, the Functional Group Mode (FGM) reaches the lowest MAE (about 13 K) for both data sets. If the FGM is used without the melting point feature, then its performance drops significantly, but it is still more accurate than those of the previous parametrizations.

SMILES Mode.
In this section, we repeat the evaluation of the ML model of the previous section, but this time for the SMILES Mode (SM) feature representation, which is built on a SMILES-based molecular descriptor. The unified version of the CHONHal data set within this feature space contains 330 unique entries, 44 more than that of the FGM model. In Figure 6A the T g predictions of the SM are plotted against the experimental T g literature values. In Figure  6B, the difference between the experimental and the predicted value, i.e., = T T T g g exp g pred , is presented for the ML model predictions with and without the melting point as a feature.
Except for a few outliers, the points that correspond to the same molecule do not deviate much from each other, implying that using the SMILES-based descriptor without the T m feature still provides very reasonable prediction results. We note a few data points, where the no-T m variant performed slightly better than that including the T m feature. Small differences on the order of a few kelvins can be probably attributed to the variance in the algorithm. In our case, the decision trees are built in a greedy fashion, which implies that the algorithm will follow the first path with the lowest variance reduction. This procedure may not necessarily lead to the global minimum, which is why minor changes in building the trees can make a difference. Moreover, there are a few data points that show higher deviation from each other (in both directions), independently of whether the Boyer−Beaman rule works well for these substances or not. Focusing on the two highest deviations in this graph (ortho-fluoroaniline and dibucaine, no T m mode) it is striking that the Boyer−Beaman rule works moderately well, and seemingly this transferred to the T m mode, which is significantly better.
In Figure 6A we compare the predictions of the SM model to the other parametrization methods. Once again the T g predictions of the SMILES Mode ML model with T m feature (purple squares) show the best overall agreement with the experimental T g values. We again performed nested crossvalidations yielding MAE values of 11.7 and 15.1 K for the SMILES Mode variants with and without the T m feature, respectively (see Table 4). The comparison shows that the T m feature significantly improves the performance of the ML model, but even without it the SM model still outperforms the other parametrization-based methods. The Boyer−Beaman rule (cyan diamonds) produced a MAE of 19.7 K. The parametrizations by Shiraiwa 3 and DeRieux 4 (yellow circles and blue triangles, respectively) are only suited for CHO compounds, which is why the data points referring to NHal components calculated with these formulas are masked as gray data points for fairness. Green triangles represent T g values predicted with the parametrization by Li et al. 5 (eq 6). Since this parametrization is only valid for CHON substances, not every nitrogen-containing substance in the test set could be considered (e.g., CHN and CHONHal compounds). Overall, the comparison in Figure 6A shows that the SMILES Mode ML model does a very good job in predicting T g for a large variety of organic compounds.
Finally, Galeazzo and Shiraiwa 12 very recently introduced the so-called tgBoost ML model for predicting T g of organic molecular compounds. They also performed a nested crossvalidation of their model, which resulted in a MAE of 18.3 K. We did not include the tgBoost model in our comparison plots, because it was built on nearly the same data set as used in our models introduced here, and it is not clear to us which of these experimental data were used for training and which for testing. Therefore, we devised an alternative procedure to provide a direct and fair comparison of the two ML models. In order to do this, we used an entirely independent data set of Alzghoul et al. 11 for testing their performance. Although this data set consists of a relatively specific data set of organic druglike substances, it fulfills valuable criteria for the comparison, since it is guaranteed that none of the two models has seen the data before. Not every data point out of the 71 substances provided could be used in the comparison, because the tgBoost model is  not meant for handling molecules with a molar mass larger than 500 g mol −1 . Moreover, also all sulfur-containing compounds were excluded, since neither of the two ML models was actually trained on such compounds. The remaining data set contained 41 compounds with experimental T g values, which is still a decent size for a test set. Figure 7 shows the predicted T g versus the measured T g for the two ML models. The graph reveals that for the majority of the samples the SMILES Mode ML model developed here provides more accurate predictions than the tgBoost model, which is also evident from the fact that the MAE values are 12.9 K for our model and 22.8 K for the tgBoost model. In addition, Figure 7B visualizes that the spread of the predicted values is more narrow for our SM model. One striking observation is that for samples with higher T g values, both models underestimate the experimental T g values, whereas for T g values below about 310 K a slight trend to overestimation is observed. One possible explanation for this behavior may be that the druglike molecules in this data set are predominantly annulated poly heterocycles with additional functional substituents. A problem with such molecules may be that the models tend to overestimate the accumulated effect of these functional groups and, hence, predict a glass transition temperature that is too high. We surmise that this behavior may be overcome by including more of such highly functionalized molecules in the training data, thereby representing such species better. Such extension of the current models is left to future studies.

Web Site.
The application of machine learning to many types of problems has become a frequently used tool in the recent past. 52,53 In addition to domain-specific expertise in the area of the particular problem, the usage of ML requires a sophisticated technical understanding in statistics and informatics. Unfortunately, even when ML models are published, they can not be easily executed by foreign users without technical obstacles. Therefore, we have developed a user-friendly web application for our ML models, which can be accessed via any standard web browser at https://tgml.chemie. uni-bielefeld.de. In this way, we want to make our model and its results available to the public. For this purpose, we chose an architecture that supports the user by automatically selecting the best model, i.e., that with the lowest MAE, for a particular set of input variables, making it comfortable and easy to use.
On accessing the web site, the user is presented with an overview page containing information about the ML model and instructions on how to operate it. Based on the molecular information available, either the Functional Group Mode or the SMILES Mode can be selected, and we recommend the SMILES Mode because of its lower MAE as mentioned before. When the SMILES Mode is selected, the user is presented with a simple input form with two fields, one mandatory field for the SMILES string of the molecule of interest and another optional field for its melting temperature. If both are available, then we recommend entering both for best accuracy. After clicking the "Start Calculation" button, an implemented input filter checks the user-given input for syntax errors in the SMILES string or unrealistic melting temperatures, and then the application automatically selects the respective ML model and determines the predicted glass transition temperature T g . Thereafter, a result window opens, which displays all relevant parameters (type of model and mode, the SMILES string, chemical formula, melting temperature, and molar mass of the compound, and finally the model's T g prediction and the associated mean absolute error). All output parameters can be copied into the clipboard by a simple click for further processing. When the Functional Group Mode is selected, the user is presented with an input form, where the number of different functional groups, the double bond equivalent and optionally the melting temperature can be entered. Again, an input filter is used for a quick and direct plausibility check of the input parameters. After starting the calculation, a result page with all relevant input and output data is presented to the user. Further technical details on the architecture of the web site are given in the Supporting Information.
The provision of our ML model for the prediction of glass transition temperatures as a web application has several advantages. First and foremost, technical barriers are greatly reduced for any user trying to make a T g prediction for a particular compound. In addition, future updated versions of the model can be easily implemented on the web site, ensuring that users always use the latest up-to-date version (to ensure the reproducibility of past results, we are planning to offer older versions of the model on a separate subpage in the future, once applicable). We further note that we intend to improve the model over time by retraining the algorithms once sufficient new experimental data points are available. Therefore, we encourage everyone to submit their experimental glass transition temperature data via a contact form on the web site on the subpage "Submit Data". 4.2. Python code. Furthermore, we also provide files for the execution of the model via a Python development environment. The bundle consists of the main Python script, a readme.txt, and a requirements_console_script.txt specifying the required packages, and the trained model modes as pickle files (pickle is a package for saving and loading trained algorithms). In the Python script, explanations for the correct input format are given along with examples. The code also enables multicomponent input, which may be a fast and convenient option for some applications. Generally, this code can be used for the further implementation into other programs or models and, therefore, could be interesting for the modeling community. The bundle is available at DOI: 10.5281/zenodo.7650576.

SUMMARY AND CONCLUSIONS
In this work, we presented a new ML model for the prediction of the glass transition temperature of organic molecular compounds. Two alternative modes were developed, in which the molecular information is either encoded in the type and number of functional groups or is extracted from a SMILES string. Both modes show increased accuracy with the additional input of the melting temperature T m as an optional feature. The model was trained with up to 330 different organic compounds, including those containing nitrogen and halogen atoms. The ML model operates on an extra trees regressor, which is an ensemble method based on decision trees. An evaluation of the performance showed that the model's predictions were significantly more accurate than those of conventional methods, i.e., fitting-based parametrizations, which was confirmed by direct comparisons with individual test sets as well as by a nested cross-validation procedure resulting in a MAE of 12−13 K (see Tables 3 and  4). For these reasons, we recommend using our new ML-based prediction model as an alternative to previously introduced conventional prediction methods to calculate T g . These parameterizations 3,4 have been used to predict T g as an input parameter for estimating aerosol viscosity in other types of models. 54−57 We suppose that using our ML model for predicting T g in those models could lead to improved results. As the ML model cannot be provided as an analytic equation, we developed a public online version of the model, where the users can input the relevant variables of the desired molecule on a web site and receive the predicted T g value as an output, without having to run the actual full code in Python environment on their own hardware. For in-model applications we also provide a Python code for use of the different models.
Our sensitivity tests revealed that the ML model has the ability to not only accurately predict the T g of simple molecules, but also shows high accuracy in the prediction of T g of highly complex structures. Altogether, the results suggest that the ML model is expected to maintain a robust performance throughout many different applications in various fields.

■ ASSOCIATED CONTENT Data Availability Statement
The experimental data set used in this study is publicly available via the Zenodo repository at DOI: 10.5281/zenodo. 7319485. A Python script for the execution of the model and T g prediction is available at DOI: 10.5281/zenodo.7650576.
Information on data preprocessing, model analysis (feature correlation and relative feature importance), technical details of the web site, tables with metric scores of the model intercomparisons, a model info sheet, and database of all experimental T g values used in this study (PDF) ■