Accelerated Scheme to Predict Ring-Opening Polymerization Enthalpy: Simulation-Experimental Data Fusion and Multitask Machine Learning

Ring-opening enthalpy (ΔHROP) is a fundamental thermodynamic quantity controlling the polymerization and depolymerization of an important class of recyclable polymers, namely, those created from ring-opening polymerization (ROP). Highly accurate first-principles-based computational methods to compute ΔHROP are computationally too demanding to efficiently guide the design of depolymerizable polymers. In this work, we develop a generalizable machine-learning model that was trained on experimental measurements and reliably computed simulation results of ΔHROP (the latter provides a pathway to systematically increase the chemical diversity of the data). Predictions of ΔHROP using this machine-learning model require essentially no time while the prediction accuracy is about ∼8 kJ/mol, approaching the well-known chemical accuracy. We hope that this effort will contribute to the future development of new depolymerizable polymers.


INTRODUCTION
−8 As of 2021, only 5% of about 51 million tons of plastic created in the United States was successfully recycled, 7 leaving the remaining material for landfilling as the main method of "storing" polymer/plastic waste.The difficulty of polymer recycling is largely due to their inherent thermodynamic, thermal, chemical, and mechanical stability.−16 A main advantage of chemical recycling (compared to mechanical recycling) is that polymers produced from the recovered monomer feedstocks can preserve their purity and all of their original properties.0][11][12]15,17 This affinity for chemical recycling seen for polymers polymerized via ring-opening polymerization (ROP) is owed to the preferable thermodynamics these polymerizations tend to have. 15 Futhermore, the polymerizability/depolymerizability equilibrium of such poly-mers may be adjusted by controllable parameters, such as ringelemental chemistry, side group functionalization, and the monomer ring size.Therefore, research and development activities aiming at understanding, engineering, and designing (depolymerizable) polymers via ROP have been very active in the context of sustainability.10,12,15,17−19 Perhaps the most important readily tunable ROP quantity is the enthalpy of polymerization (ΔH ROP ), defined as the difference between the internal energies of the resulting polymers and the monomers used in the polymerization process.This thermodynamic quantity, which is closely related to the monomer ring size and the ring strain, can be measured 19,20 and computed 18,21−25 at reasonable levels of fidelity.Traditionally, ΔH ROP was computed by opening a ring monomer atomic configuration (believed to be its ground state), passivating the dangling bonds by suitable end groups, and then computing the energies using first-principles computations.21−25 This procedure is simple, but reaching acceptable accuracy is challenging. 18The main reason could be traced back to the soft-material nature of polymers, which are certainly not locked into any single atomic configuration, especially at and above room temperatures.Therefore, another method has recently been developed 18 that adequately samples the space of polymer and monomer atomic configurations at the level of first-principles computations for better estimation of ΔH ROP .While this advanced method is significantly more robust and accurate than the traditional method, it is also very computationally demanding.18 The main objective of this paper is to utilize machinelearning (ML) approaches 26−28 to build predictive models of ΔH ROP trained on data from experiments and the newly developed computational method, i.e., ΔH expt ROP and ΔH comp ROP .The reason for using two sources of data, experiments and computations, is the following: while experimental data constitute the ground truth, it is typically limited and tends to grow slowly.On the other hand, computational data, although full of built-in approximations owing to practicality, can be produced at scale, grown rapidly, and span new chemical spaces not seen in experimental investigation.As the training data of the model comes from two different sources, a multitask machine-learning approach 29,30 was utilized.The main motivation of a multitask learning algorithm/model is that by simultaneously learning multiple targets, ΔH expt ROP and ΔH comp ROP , the underlying correlations between them can be exploited and transferred to the model, 31 making it more robust and generalizable (to new chemical spaces) than an ML model trained on just the ground-truth experimental data set independently, otherwise known as a single-task model.Such ML approaches have helped design new-to-the-world polymers possessing attractive properties in the past.2,32,33 Toward this goal, we have generated and/or curated a comprehensive database of experimentally measured and computed ΔH ROP , namely, ΔH expt ROP and ΔH comp ROP , and developed an ML model to instantly predict ΔH expt ROP for new chemistries.This work focuses particularly on ROP chemistries as this class of polymers has repeatedly shown promise in producing polymers that can be recycled chemically.10,12,15,17−19 Figure 1 shows the overall pipeline enabled by the newly developed ML model.In the subsequent part of Section 2, we describe all of the critical components of the machine-learning approach to ΔH ROP , including data generation and capture, polymer fingerprinting, and learning architectures and evaluations.
Going forward, the ΔH ROP prediction model will (1) be extended to handle progressively more novel chemistries as newer data become available, (2)   34 Starting from millions of HTML/XML formatted articles, the procedure then occurred in four steps, including (1) document parsing, converting original documents to a format that is suitable for NLP, (2) coarse-grained filtering, where appropriate keywords were used to downselect several to thousands of articles from the initial set, (3) extracting useful information from the downselected papers, and (4) validating the extracted data by domain experts.
In this procedure, step 3 includes three substeps, i.e., (3a) target sentence identification, (3b) material name identification, and (3c) linking material to property.In (3a), heuristic rules were employed to identify candidate sentences.They included searching for sentences containing property names, e.g., enthalpy of polymerization, and units, e.g., kJ/mol or kcal/ mol.In (3b), two models were used to identify the compound names.The first model is ChemDataExtractor, 35 an opensource Python library that extracts chemical names using regular expression (i.e., regex patterns), and the second model is a BERT-based named entity extraction model 36 trained on a data set of sentences with manually labeled polymer names.Linking the identified material names to property values, which can be formulated as a relation extraction task, was performed in (3c).In this substep, the last material appearing before the property name is regarded as the owner of the property.These methods resulted in an NLP augmented literature search that greatly improved the speed of the data extraction and, as a result, the amount of ΔH expt ROP data.With the aid of these methods, the ΔH expt ROP data set was expanded from 88 manually collected data points to 109 data points, resulting in an approximate 24% increase of ΔH expt ROP data.

Computed ΔH comp
ROP Data Generation.In this work, ΔH comp ROP was generated using the multistep procedure developed in ref 18.First, a series of closed loops comprised of L monomer repeat units were constructed using Polymer Structure Predictor. 37,38These loops are representations of polymers.As L → ∞, the loop approaches the true polymer limit, and L = 1 represents the monomer.The computations were generally performed for L = {1, 3, 4, 5, 6}.A classical molecular dynamics (MD) simulation using an empirical Reax force field 39 was performed for each monomer/polymer model, thoroughly exploring the configuration space while preserving the atomic connectivity.Using classical MD, trajectories of over 1 ns were generated and thousands of snapshots were obtained and sampled to maximize the diversification of the sample set to then be used in ab initio MD simulation.The purpose of this step, using classical MD, is to provide a set of

The Journal of Physical Chemistry A
maximally diverse initial atomic structures on which to run ab initio MD.None of the data generated by classical MD are used to calculate ΔH comp ROP , and thus, no data resulting from classical MD are part of the ΔH comp ROP data used in subsequent multitask learning.For further information regarding the exact parameters used to run the classical MD simulation to generate initial structures for ab initio MD, see Supporting Information.
Next, a room-temperature ab initio MD simulation was performed for each sample, obtaining the lowest-energy equilibrated trajectory.The L-dependent estimation of ΔH comp ROP was then computed as , where E L and E 1 are the potential energies at equilibration of the ab initio MD trajectories of the polymer model (L > 1) and monomer model (L = 1), respectively, while ⟨•••⟩ stands for the average over the ensemble of the microstates.Finally, ΔH comp ROP was defined and computed as the ROP was computed by assuming that ΔH L ROP depends linearly on 1/L and then making suitable extrapolations to the limit of 1/L → 0. For the development of our target ML model in this work, ΔH L ROP data will be used directly as training data, i.e., the dependence of ΔH L ROP on L will be learned implicitly by the selected ML algorithms.Technical details of this plan can be found in Section 2.4.
The central idea of this computational scheme is that polymers are soft materials; thus, they are naturally not locked at any specific atomic configuration but rather switch across multiple microstates continuously and rapidly.Therefore, this scheme was designed to thoroughly explore the configuration space at two levels: The first is in a "coarse-grained" fashion, using a Reax force field with Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS). 40The second is using Density Functional Theory (DFT) with Vienna Ab initio Simulation Package (VASP). 41,42The energies relevant to ΔH comp ROP are computationally averaged over an ensemble of microstates at the DFT level.While it has been shown that these methods can lead to very accurate predictions for ΔH expt ROP via linear extrapolation, 18 it should be noted that the type of long-range polymer dynamics necessary to predict ΔH expt ROP with certainty cannot fully be accounted for with DFT alone.More details on the computational scheme can be found in ref 18 ROP .Given the nature of our first-principles computational scheme, the generation of ΔH comp ROP can be performed in a high-throughput, consistent, and targeted manner, i.e., ΔH comp ROP can be generated for certain polymers so that the training data can be diversified and the target ML model can become progressively more robust with respect to new chemistries.
2.4.Polymer Data Fingerprinting.The generated/ curated polymer data must be represented (fingerprinted) in machine-readable numerical form before they can be used to train the targeted ML model. 27,28,43Our data of ΔH ROP contain three classes of information, including the chemical structure of the polymers, usually given in terms of a SMILES string, 28,44 the nature of ΔH ROP , i.e., whether the data point is from experimental or computed sources (specified as (1, 0) or (0, 1), respectively), and the loop size specified as for all ΔH expt ROP data).Using the hierarchical fingerprinting procedure that was developed 27,28,43 during the past decade and currently used in Polymer Genome 27,28 the polymer SMILES is converted into a numerical vector of over 200 dimensions (or columns) to represent the chemical structure of the polymers.The three classes of information (chemical, data source, and L 1 ) were stacked into a composite fingerprint that was then mapped onto the target properties, i.e., ΔH expt ROP and ΔH comp ROP .Feature engineering, namely, permutation feature engineering, was subsequently used for each machine-learning algorithm tested in Section 3.1 to reduce the number of dimensions of the overall fingerprint to 80.This procedure is generic and can be used to prepare training data emerging from multiple sources.−47 With a scheme for creating the training data fingerprints for multitask ML, a suitable algorithm is needed to map the composite fingerprints onto the targeted property values.Four algorithms that are suitable for small training data sets, including Support Vector Machine (SVM), Random Forest (RF), Boosted Random Forest (BRF), and Gaussian Process Regression (GPR), were tested to determine the best learning technique for our data.The results for each learning algorithm are described in the following sections.

Machine-Learning Models and Validation.
The four algorithms considered were evaluated in a customized leave-one-out cross-validation (LOOCV) protocol in which a held-out polymer, for which ΔH expt ROP is available, is targeted and predicted by the ML models trained with four different training set schemes (also referred to as "cases").These cases were designed to systematically examine and reveal the role of ΔH comp ROP , the subsequent benefit of multitask learning, and the performance of the developed models.These four cases are The Journal of Physical Chemistry A summarized in Table 2.In the first case, only the available experimental data were used for training, so the model is "effectively" a single-task (ST) model, and so, this case is named ST.The next three cases are MT1, MT2, and MT3, which are designed to gradually supply the (multitask) learning algorithms with selected subsets of computational data, i.e., ΔH comp ROP , and, consequently, gradually improve ML models.Among three multitask (MT) cases, MT1 does not include computed data of any size (L) for the held-out polymer.This simulates the case when there is no computational data available for the polymer of interest being predicted.The MT2 case assumes that there is minimal computational data available, i.e., just corresponding to L = 3, in the training data for the held-out polymer.Finally, the MT3 case represents the situation where plenty of computational data are available for the held-out polymer being predicted.
Table 3 shows two error metrics, i.e., the root-mean-square error (RMSE) and the determination coefficient (R 2 ) obtained by using SVM, RF, BRF, and GPR for all 4 cases, namely ST, MT1, MT2, and MT3.The presented results were obtained by (1) selecting a held-out polymer for which ΔH expt ROP is available, (2) preparing the training data for the four cases as defined in Table 2, (3) using a learning algorithm to train an ML model for each training data set, (4) making predictions on the heldout polymer, and (5) screening over all the possible (68) heldout polymers to get the prediction metrics (i.e., RMSE and R 2 ).Hyper-parameters for a given ML algorithm were chosen using 5-fold cross-validation and a grid approach, where all permutations of a list of hyper-parameters were tested prior to the LOOCV scheme described above.In step (5), for the sake of a fair comparison, the held-out polymer was selected in the subset of 68 unique polymers for which both ΔH expt ROP and ΔH comp ROP are available.The obtained results, which are shown in Table 3, demonstrate that by combining computed data and experimental data, the trained (multitask) ML models are improved in accuracy.−51 Using GPR, RMSE is reduced from 12.2 kJ/mol for ST (trained only on experimental data) to 9.2 kJ/ mol for MT1, 8.8 kJ/mol for MT2 and 8.0 kJ/mol for MT3.This MT3 value comes close to the desired chemical accuracy, which is about 5 kJ/mol.Therefore, GPR 52 was selected for the eventual development of the predictive ML "production" model of ΔH expt ROP .Figure 3 visualizes the predictions performed for all the possible (68) held-out polymers in all four cases, given with respect to the ground truth, i.e., ΔH expt ROP for each of the four algorithms tested (RF, SVM, BRF, and GPR).
Some valuable notes can be drawn from the LOOCV analysis.First, the performance of ST models for all algorithms does not show satisfactory enough accuracy.We attribute this to be due to data scarcity, and it is the motivation for why such a large ΔH comp ROP data set was developed and multitask learning was employed.Next, in the case of GPR, adding in computed data that are not associated with the held-out polymer (MT1) improves the model's accuracy for predicting the unseen polymer.This is seen in the improvement of both RMSE and R 2 seen from ST to MT1.We believe this improvement from ST to MT1 is due to greater generalizability of the model as a result of greater chemical coverage represented in the computational data and thus evidence of the benefit of multitask learning.Second, significant improvement is seen from MT1 to MT2 where only the computationally least expensive ab initio MD computation is performed.This suggests that computed ΔH comp ROP , especially ΔH L = 3 ROP , can be  ROP computed for the held-out polymer is included MT3 experimental data + all computed data, including ΔH L = N ROP for all N computed for the held-out polymer The Journal of Physical Chemistry A done in a high-throughput manner, in order to develop a multitask model that can predict ΔH expt ROP for the cases of interest with satisfactory accuracy.Lastly, for all algorithms except for RF we see yet another improvement on going from MT2 to MT3, which shows that additional ΔH comp ROP of various sizes helps the ML models improve their ability to extrapolate to the experimental case.
3.2.Production Model.Given the analysis described in Section 3.1, we concluded that GPR is the algorithm of choice to develop a production multitask ML model that is trained on all ΔH comp ROP and ΔH expt ROP data.The main objective of this model is to predict the ΔH expt ROP from the chemical structure, or the SMILES, of the polymer that is obtained by opening a ring monomer.Because GPR returns not only the target value prediction but also an intrinsic measure of the prediction uncertainty, 52  ).

The Journal of Physical Chemistry A
the approximate accuracy reported when linearly extrapolating from multiple ΔH comp ROP of different sizes to the case of an infinite-sized model. 18or all cases of testing the model, it seems due to data scarcity that data performance is limited.This can be seen as a large difference between test and train RMSEs and in the fact that there is no leveling of the test curves for any case of ΔH comp ROP data availability in Figure 5.To generate Figure 5, test train splits that varied from 10% train and 90% test to 99% train and 1% test were randomly split among the ΔH expt ROP data.In each of these splits, the split was done randomly and 100 times in order to collect statistics for how different random splits could affect the accuracy of the trained ML model, allowing for the error bars to be plotted.In this random splitting of ΔH expt ROP , the ΔH comp ROP data subsequently added to the training was intentionally modified so the same cases outlined in Table 2 were tested in the learning curve as well.Figure 5 shows the importance of continued data expansion, and while experimental data are the highest fidelity data that can be used for model training and evaluation, the expansion of DFT data is much easier and faster to perform.Further, from the results of the LOOCV analysis specifically for the GPR algorithm shown in Figure 3 and Table 3, it seems evident that loop size 3 DFT data, i.e., ΔH L = 3 ROP , the cheapest data to gather from a time and computational resource standpoint, are significantly helpful in obtaining better predictions.Thus, in an effort to continue to improve the models, the ROP chemical space will continue to be searched first with ΔH L = 3 ROP computations in an effort to best create an ML model that can generalize to diverse chemistries.
Finally, the production model was developed using GPR and the choice of kernel was discovered to be optimal during LOOCV.Just prior to this, a 10-fold cross-validation was performed to achieve an average train RMSE of 1.55 kJ/mol and an average test RMSE of 8.80 kJ/mol.It is evident that overfitting is still present, but this is a common problem with a training data size of a few hundred as we have in this work.This work will continue and the production model will constantly be updated by training on new ΔH comp ROP data that is generated.

CONCLUSIONS
In this work, we have developed a largest-of-its-kind data set of ΔH ROP , which consists of data from both experimental measurements and high-throughput computations using a recently developed first-principles scheme. 18This data set was then leveraged to develop a multitask ML model that can predict the experimental value of ΔH ROP with an accuracy of 8 kJ/mol that approaches the (gold standard) chemical accuracy of about ≃5 kJ/mol.Given its high accuracy, this model is expected to contribute to the development of depolymerizable polymers via ROP.Polymers synthesized via ROP are focused on particularly in this work due to their shown potential in literature to create polymers that have the necessary polymerization thermodynamics to be depolymerized.Data from future experiments and computations will be used to further improve this model.
In addition, all experimental and computational data are also provided in excel format along with chemical structures and SMILES strings in two different formats ("long" and "wide").All codes and the data files may also be found in the Ramprasad github at https://github.com/Ramprasad-Group/enthalpy_ml_paper_code (ZIP) A table in PDF format detailing the experimental data collected from the literature is provided as part of the SI (PDF)  The Journal of Physical Chemistry A

Figure 1 .
Figure 1.Flowchart describing the overall computational workflow: an initial data set of both ΔH expt ROP and ΔH comp ROP is vectorized in such a way that the data source (ΔH expt ROP or ΔH comp ROP ) as well as the chemistry present are machine readable.Next, the multitask model to predict ΔH ROP is trained.Then, with this ML model and chemical intuition, the ROP chemical space can be further explored, and the most promising polymers can be suggested to perform additional ab initio computations generating new ΔH comp ROP data.Then these data can be fed back into training to improve the ML model.

Figure 2 .
Figure 2. Correlations between ΔH expt ROP and ΔH L = N ROP , shown for (a) L = 3, (b) L = 4, (c) L = 5, (d) L = 6, and (e) L = ∞.In the plots, r corresponds to the Pearson correlation between ΔH expt ROP and ΔH comp ROP and indicates how well-correlated the variables are for a given L.

Figure 5 .
Figure 5. Learning curve for the different cases as described in the Table2.Here, each case is indicated by a different color.The shape of the marker indicates between test and train performance, where the x indicates train and the dots indicate test.

2. METHODOLOGY 2.1. Experimental ΔH expt ROP Data Capture. Capturing
inform the next rounds of experiments and computations with attractive ΔH ROP and other property values, and ultimately, (3) aid in the accelerated design of depolymerizable and functional polymers.

. 2.3. Data Summary.
Table 1 provides a summary of our data set, which contains 193 unique ROP polymers and corresponding ΔH expt ROP and/or ΔH comp ROP .Among them, 109 ROP polymers have been studied experimentally with ΔH expt ROP values available, while for the remaining 84 polymers, only ΔH comp ROP data are available.Within the first subset (of 109 ROP polymers for which ΔH expt ROP data are available), ΔH comp ROP was computed for 68 polymers, leaving 41 polymers with ΔH expt ROP only.The "overlap" of 68 polymers that have both ΔH expt ROP and ΔH comp ROP is important for our work because, as revealed in Figure 2, experimental data and computed data are strongly correlated (with the correlation increasing with increasing L value).The main objective of multitask learning is to learn and incorporate such correlations implicitly in the ML model targets (ΔH expt ROP and ΔH comp ROP ), making the ML model more robust for cases for which ΔH expt ROP is not available.The subset of ΔH comp ROP contains 84 + 68 = 152 unique ROP polymers and 428 data points, which can be broken down to 199 data points for ΔH L = 3

Table 1 .
Summary of the ΔH ROP Data Generated, Accumulated, and Used Herein

Table 2 .
Summary of Four Cases Used in Evaluating the ML Algorithms, Which Are Different in the Training Data

Table 3 .
RMSE, Given in kJ/mol, and R 2 Obtained from SVM, RF, BRF, and GPR for Different Cases Described in the Text the selection of GPR for the production model has an extra advantage.Given a new polymer, a large prediction uncertainty clearly indicates that the chemistry of the polymer is not very well represented in the training data, and in this case, performing some computations for ΔH comp Parity plot for the preproduction model where 10% of the ΔH expt ROP data were withheld.Blue data points represent the test data, while gray data points represent the train data (which contain both ΔH comp ROP and ΔH expt ROP