PolyID: Artificial Intelligence for Discovering Performance-Advantaged and Sustainable Polymers

A necessary transformation for a sustainable economy is the transition from fossil-derived plastics to polymers derived from biomass and waste resources. While renewable feedstocks can enhance material performance through unique chemical moieties, probing the vast material design space by experiment alone is not practically feasible. Here, we develop a machine-learning-based tool, PolyID, to reduce the design space of renewable feedstocks to enable efficient discovery of performance-advantaged, biobased polymers. PolyID is a multioutput, graph neural network specifically designed to increase accuracy and to enable quantitative structure–property relationship (QSPR) analysis for polymers. It includes a novel domain-of-validity method that was developed and applied to demonstrate how gaps in training data can be filled to improve accuracy. The model was benchmarked with both a 20% held-out subset of the original training data and 22 experimentally synthesized polymers. A mean absolute error for the glass transition temperatures of 19.8 and 26.4 °C was achieved for the test and experimental data sets, respectively. Predictions were made on polymers composed of monomers from four databases that contain biologically accessible small molecules: MetaCyc, MINEs, KEGG, and BiGG. From 1.4 × 106 accessible biobased polymers, we identified five poly(ethylene terephthalate) (PET) analogues with predicted improvements to thermal and transport performance. Experimental validation for one of the PET analogues demonstrated a glass transition temperature between 85 and 112 °C, which is higher than PET and within the predicted range of the PolyID tool. In addition to accurate predictions, we show how the model’s predictions are explainable through analysis of individual bond importance for a biobased nylon. Overall, PolyID can aid the biobased polymer practitioner to navigate the vast number of renewable polymers to discover sustainable materials with enhanced performance.


Table of Contents
Supplementary Tables Table S1 -Hyperparameter results.Table S2 -Single vs. Multi-task predictions.Table S3 -Hyperparameter loss and error data.Table S4 -Database statistics and prediction performance.Table S5 -Domain of validity metric.Table S6 -Impact of data on accuracy.Table S7 -Experimental data for validating model performance.Table S8 -Theoretical yields for metabolites in metabolic model.Table S9 -Accessible bio-based polymers.Table S10 -Theoretical yields and predicted polymer properties.Table S11 -Comparing Model Performance Table S12 -Database composition.Table S13 -Database of polymers and properties from literature reports

Supplementary Figures
Figure S1 -Latent space embedding.Figure S2 -Hyperparameter optimization for polymer property prediction using message passing neural networks.Figure S3 -Test set loss as a function of network depth and polymer size Figure S4 -NMR of selected synthesized polymers.Figure S5 -Polymer structures of performance-advantaged PET replacements.Figure S6 -Poly(ethylene 5-carboxyvanillate) analysis Figure S7 -Select diols and diacids from bio-based monomer database.Figure S8 -In silico polymerization scheme.Figure S9 -PolyID pipeline and graph neural network architecture.The bond latent space embedding is shown by plotting the two principal components from the bond vectors.The plot shows ester bonds spatially differentiate themselves in latent space based on the polymer type after multiple message-passing layers.

Figure S2 -Hyperparameter optimization for polymer property prediction using message passing neural networks.
Effect of polymer structure, graph network topology, and training parameters on test set loss for all 8 predicted polymer properties.
The approach to hyperparameter optimization here was twofold: (1) to provide insights into the relationship between network design (e.g., number of message passing layers, size of atom and bond feature vectors) and polymer structural representations (e.g., monomers in each polymer chain), and (2) to select a set of hyperparameters that perform reasonably well.While exhaustive grid search or advanced hyperparameter optimization techniques could potentially identify a more optimal set of hyperparameters, the approach taken here balances scientific understanding and computational expense.
In Figures 2A-C and Figure S2 each datapoint represents an independent message passing neural network trained with different hyperparameters.To train each independent network, a standard 10-fold cross-validation was used and a single hyperparameter value was changed while holding all other hyperparameters constant.Each of the 10 k-fold models were evaluated using a hold-out "test" set and the average error was calculated across the 10 k-folds.Table S1 provides the ranges for each independently varied hyperparameter, the constant values used for the other hyperparameter variables when that variable was not the varied hyperparameter, and the determined optimal value.Figure S2 shows the results of the test set loss, which represents the aggregate error across all 8 predicted properties and 10 k-fold models.Figures 2A-C shows the melt temperature mean absolute error to provide a more interpretable and exemplary version of the results.

Figure S3 -Test set loss as a function of network depth and polymer size.
Test set loss (lower is better) as a function of the number of monomers in a polymer chain and the number of message passing layers in the network.Figure S3 shows the benefit of increasing network size when increasing polymer size holds true for unseen data (i.e., test set).The effect is also presented in Figure 2D for the validation set loss.To generate this data, all other hyperparameters remained the same.The number of monomers per polymer chain and the depth of the network were varied and 16 different models were trained and evaluated using a 80/20 train/test split.Each training set used 10-fold cross validation during training.To compare prediction accuracy as a function of the number of properties being predicted by a network, four models were trained using the same hyperparameters.Only the properties used in the training task were varied.In the first model only glass transition was used.In the second model glass transition and melt temperature were used.In the third model glass transition, melt temperature, and density were used.In the fourth model all 8 properties were used.From Table S2, no consistent performance improvement for multi-task learning is observed.Others have found more substantial improvements. 1,2The difference in findings may be due to network architecture, polymer structure representation, data set composition, or data set size.Additional benchmarking studies will be needed to determine benefits of multi-task learning.

Table provided in additional attachment SI_Table-of-reachable-metabolites.csv
In Table S8, the functionality was determined through a structure-based search for the functionality in each column using Rdkit.Diols, diacids, diamines, hydroxy acids, and amino acids were selected based on the molecular structure.
To determine if a monomer would polymerize in polyester, polyamide, or polycarbonate reaction, the following criteria were applied for each monomer type: Diols: Diols for polymerization contained two hydroxyl groups, and the diols did not contain any acids or primary amines.The hydroxyl groups are aromatic or aliphatic for polycarbonates and for polyesters the hydroxyls are only aliphatic.In the case of polyesters, monomers that contained aromatic hydroxyls were not counted towards the number of hydroxyls in the structure, and the aromatic hydroxyls were not considered for the condensation reactions with acids due to the significant relative reactivity of aliphatic vs. aromatic hydroxyl groups.
Diacids: Diacids for polymerization to make polyesters or polyamides when combined with diols or diamines, respectively, contain two acid groups.The diacids did not contain any aliphatic hydroxyls or primary amines.
Diamines: Diamines for polymerization to make polyamides when combined with diacids contain two primary amine groups.The diamines did not contain aliphatic hydroxyls or acid groups.
Hydroxy acids: Hydroxy acids for polymerization to make polyesters contained one aliphatic hydroxyl and one acid, and the hydroxy acids contained no primary amines.Monomers that contained aromatic hydroxyls were not counted towards the number of hydroxyls in the structure, and the aromatic hydroxyls were not considered for the condensation reactions with acids due to the significant relative reactivity of aliphatic vs. aromatic hydroxyl groups.
Amino acids: Amino acids for polymerization to make polyamides contain one primary amine and one acid.The amino acids contained no aliphatic hydroxyls.PABP replacements for PET which have been predicted to have a glass transition temperature above 100 °C and have an O 2 permeability equal to or lower than PET.ND indicates the polymer was not synthesized due to limited access to the monomer.The error bars indicate the standard deviation for predictions made by the 10 trained models produced from the 10-fold cross-validation.The only known report for using any of the three identified diacids in polyesters is by Hevus who reported a T g of 17 °C for poly(1,6-hexanediol 4-hydrophthalate).CNMR, and GPC of dimethyl ester monomer and reaction products from polyester synthesis.

Figure S6A
. TGA thermograms of the dimethyl monomer and the two reaction products from differing methods.The observed T d 50% of the monomer, standard synthesis product, and the extended synthesis product are 211 °C, 346 °C, and 405 °C respectively.The increase of ~130 °C in thermal stability provides further evidence of successful polymerization of the polyester.Furthermore, the ~50 °C increase of the extended synthesis suggests higher degrees of conversion with the possibility of crosslinking.

Figure S6B
. DSC thermograms of the two reaction products under different synthesis conditions described in the methods section.Polymers were annealed in the DSC Discovery 25 (TA Instruments) on the first cycle to 200 °C at 10 °C/min.Presented here is the second thermal cycle showing that one attempt at a standard polyester synthesis yielded a T g of 85 °C.An extended synthesis strategy yielded a higher T g of 112 °C.From this thermal data, it appears that the extended synthesis conditions resulted in higher conversion as indicated by a higher T g in accordance with prediction from the Fox-Flory equation.

Figure S6C
. FTIR transmittance of the dimethyl 5-carboxyvanillate monomer, the reaction product under standard polyester synthesis and the reaction product from an extended synthesis method is shown.The observation of the two carbonyl peaks (1 and 2) are indicative of the two carbonyl environments present on the monomer/polymer.As the reaction shifts to longer time, the carbonyl environment shifts to 2 which can be attributed to an averaging of the chemical environments from a higher conversion polymer as the extended synthesis and thus higher molecular weight.The alkane peak (3), which is attributed to methoxy on the 5-CVA, is conserved regardless of synthesis strategy.The hydroxy peak (4) wanes from the monomer to the standard synthesis as hydrogen bonding is reduced.This peak is practically diminished in the extended synthesis which can be attributed to either higher molecular weights, intra-chain hydrogen bonding becoming more pronounced, or side reactions with the hydroxy group on the 5-CVA during synthesis.(Bottom) The reaction product under a standard polyester synthesis.The dimethyl 5-CVA monomer is pure with expected and labeled integrations.The standard synthesis spectra indicates the presences of an ethylene glycol unit between the 5-CVA units, as observed by the presences of peak j.As the 5-CVA monomer is non-symmetrical, there is the possibility of different shifts, labeled j* and j**, due to head-to-tail, head-to-head, and tail-to-tail configuration or the possible presence of bis(hydroxy-ethyl) end groups.Additionally, due to the propensity of 1,3 dicarboxylic acid substituted benzenes (e.g.isophthalic acid) 4 to undergo cyclization reaction could lead to the peak of variable integrations, x.Importantly, the polymer from the extended synthesis conditions possessed poor solubility in NMR solvents either indicative of side reactions or significantly larger molecular weight.NMR of extended synthesis product not performed due to insolubility.

Figure S6E
. 13 C NMR data collected on a Bruker 400 MHz Spectrometer to provide structural identification of the monomer (Top) and the (standard) reaction product (Bottom).The structural identity of the monomer is confirmed in the above spectra.The identification of the 5-carboxy linkage carbon's was confirmed as well.Identification of the ethylene glycol linkages proves to be challenging as an unknown peak appears downfield of the methoxy © and alkane (p) carbon.Similar to the 1 H-NMR assignments, this unidentified peak (x) may be attributed to bis(hydroxyethyl) end groups or cyclic oligomers.NMR of extended synthesis product not performed due to insolubility.

Figure S6F
. GPC data of the standard synthesis reaction product collected on an Agilent 1260 Infinity II LC system with a MiniDawn TREOS Multi-Angle Light Scattering (MALS) detector (Wyatt) and an Optilab T-rEX differential Refractive Index (dRI) detector (Wyatt).The polymer product is near the detection limit of the columns and is subject to high uncertainty.Qualitatively, most of the reaction product is low molecular weight polymer with small concentrations of higher molecular weight, as determined by the large intensity of the MALS detector at 22 minutes.Assuming a dn/dc value of 0.2570 (PET), a M n ,M w , and dispersity were determined to be 608 Da, 1,220 Da, and 2.01.GPC of extended synthesis product not performed due to insolubility.
transition temperature for c,c-muconic acid-based polymers which used two different models for predicting the values.The first model was parameterized using a training set with no c,c-muconic acid-based polymers and the second contained a single instance of a muconic acid-based polymers, poly(1-4 butanediol-co-c,c-muconic acid)

Figure S6D. 1 H
Figure S6D.1 H NMR data collected on a Bruker 400 MHz Spectrometer.(Top) Dimethyl 5-carboxyvanillate monomer.(Bottom) The reaction product under a standard polyester synthesis.The dimethyl 5-CVA monomer is pure with expected and labeled integrations.The standard synthesis spectra indicates the presences of an ethylene glycol unit between the 5-CVA units, as observed by the presences of peak j.As the 5-CVA monomer is non-symmetrical, there is the possibility of different shifts, labeled j* and j**, due to head-to-tail, head-to-head, and tail-to-tail configuration or the possible presence of bis(hydroxy-ethyl) end groups.Additionally, due to the propensity of 1,3 dicarboxylic acid substituted benzenes (e.g.isophthalic acid)4 to undergo cyclization reaction could lead to the peak of variable integrations, x.Importantly, the polymer from the extended synthesis conditions possessed poor solubility in NMR solvents either indicative of side reactions or significantly larger molecular weight.NMR of extended synthesis product not performed due to insolubility.

Figure S10 -
Figure S10 -Training loss.Exemplary loss and validation loss curves for training message passing neural network.Grey lines indicate individual models for the 10-fold cross-validation and red lines indicate average value across all 10 models.

Table S1 -Hyperparameter results. Ranges
used and optimal values determined in hyperparameter optimization scheme of message passing neural network for making predictions on polymer structures.The constant values are the values used for the other parameters while a single parameter was being varied within the range of values.

Table S2 -Single vs. Multi-task predictions. Ranges
used and optimal values determined in hyperparameter optimization scheme of message passing neural network for making predictions on polymer structures.

Table S3 -
Hyperparameter loss and error data.The mean absolute error as shown in Figure 2C and the test set loss data for Figure S2.

Table S4A -Database statistics and prediction performance for the full database.
Training database statistics, mean absolute errors for 10-fold validation set, and mean absolute errors 20% hold-out test set.The database used for training and evaluating the model that produced the data in this table included data in SI_Table-of-polymerproperties.csv and external databases that could not be reproduced due to copyright.Details of the full database are provided in the Methods section of the main text.

Table S4B -Database statistics and prediction performance for published database
. Training database statistics, mean absolute errors for 10-fold validation set, and mean absolute errors 20% hold-out test set.The database used for training and evaluating the model that produced the data in this table only included data in SI_Table-of-polymerproperties.csv.

Table S5 -
Domain of validity metric.Table containing mean absolute error for glass transition temperature values as a function of the number of substructures for the prediction structures that are outside of the substructures found in the training set.Substructures were generated using Rdkit's Morgan fingerprints method with a radius equal to two.As the number of substructures outside the training set decreases, the mean absolute error improves.

Table S6 -
Impact of data on accuracy.Table containing mean absolute error for the glass

Table S7 -
Experimental data for validating model performance.Properties of experimentally synthesized bioaccessible polymers and associated predictions for T g and T M .

Table S8 -Theoretical yields for metabolites in metabolic model
. A table containing values for the raw yield, percent yield, chemical functionality, and monomer class for polymerization for analytes in the four metabolic models.

Table S9 -
Accessible bio-based polymers.The number of unique polymers that can be generated from each of the following databases: Metacyc, Mines, KEGG, and BiGG.Totals for each polymer class account for overlaps between databases and are therefore not the sum of the column.

Table S12 -
Database composition.Training set database count broken down by polymer class.

Table S13 -Database of polymers and properties from literature reports.
Table containing polymer properties, monomers, and polymer structures that were curated from literature reports.Polymer structures were generated using the monomers-2-polymers code base.
Table provided in additional attachment SI_Table-of-polymer-properties.csv