Data-Driven Identification and Analysis of the Glass Transition in Polymer Melts

Understanding the nature of glass transition, as well as the precise estimation of the glass transition temperature for polymeric materials, remains open questions in both experimental and theoretical polymer sciences. We propose a data-driven approach, which utilizes the high-resolution details accessible through the molecular dynamics simulation and considers the structural information on individual chains. It clearly identifies the glass transition temperature of polymer melts of weakly semiflexible chains. By combining principal component analysis and clustering, we identify the glass transition temperature in the asymptotic limit even from relatively short time trajectories, which just reach into the Rouse-like monomer displacement regime. We demonstrate that fluctuations captured by the principal component analysis reflect the change in a chain’s behavior: from conformational rearrangement above to small fluctuations below the glass transition temperature. Our approach is straightforward to apply and should be applicable to other polymeric glass-forming liquids.

ESPResSo++ [S5, S6]. For testing our data-driven approach, we take MD simulation trajectories at every 200τ in the time frame between 200τ and 3×10 4 τ (shaded gray colored region). We have also taken the trajectories at every 0.2τ in the time frame between 0.2τ and 20τ (shaded blue colored region) for comparison. The Rouse-like scaling law g 1 (t) ∼ t 1/2 is represented by straight lines.

S-II. STATIC PROPERTIES OF POLYMER MELTS
The conformation of a polymer chain usually is described by the radius of gyration R g and end-to-end distance R e as follows, and R e = (r nm − r 1 ) 2 (S2) where r i is the coordinate of ith monomer in the chain, and r c.m. is the center of mass of chain. As mentioned above, the profiles of probability distributions of R g and R e for all 2000 chains in the melt should remain the same within fluctuation at all temperatures, as shown in Figure S2. . Schematic representation of the two methods employed in the paper for the determination of glass transition temperature T g . The workflows uses intra-chain distances of individual chains from the melt. It can be applied in two independent ways: by projecting with PCA combined data from temperatures followed by clustering, where the change in cluster indexes indicates T g (Method I, upper raw). Or by applying PCA to each temperature separately and using changes in leading eigenvalues or participation ratio as the criteria to define T g (Method II, lower raw).

S-IV. VARIANCE EXPLAINED RATIO OF PCA PROJECTIONS
Four leading PCs containing ≈ 60% of a variance of the data based on the gap in the variance explained ratio 1 ( Figure S4).

S-V. CLUSTERING METHOD -DBSCAN
The projected data is grouped using the unsupervised learning technique known as clustering analysis. Typically grouping is done based on some common property, i.e. closeness in space, similarities in densities (same number of neighbours within a cut-off separated by low-density regions), etc. The choice of clustering technique is then motivated by the properties based on which the data should be grouped, properties of the data itself (e.g. the number of data points, dimensionality etc.), as well as by some additional knowledge about the system. All clustering techniques require initial parameters to define clusters. Then they try to arrange the data points into groups. As a result each data point is assigned an index corresponding to the group it belongs to. Such an index is also called a cluster index (ID).
In our case, each data point in the projection corresponds to one configuration of a chain at a given time and temperature. Expecting that the configurations at lower temperatures will be much closer together and disordered at higher temperatures for clustering we chose the density-based spatial clustering of applications with noise (DBSCAN) [S7]. The acronym was given by the authors of the clustering algorithms (Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu, 1996) and for our case, it refers to distances and densities in four-dimensional projected PCA space, where now one chain is represented as a point in four dimensions. It is designed to find high dense regions in space as separate clusters and assign all sparse points as unclustered or "noise".
DBSCAN parameters. DBSCAN groups points that are close to each other based on a distance measure (usually Euclidean distance 2 ) and a minimum number of points within some cut-off.
There are two parameters one needs to specify for DBSCAN: 1) A cut-off value to define proximity in space -minimum distance (ϵ d ): indicates the radius within which points to be regarded as neighbors. Two points are regarded as neighbors if the distance between them is less than or equal to this cut-off value ϵ d . Small ϵ d is used for defining denser clusters. On the other hand, if ϵ d is too high, the majority of the data will be in the same cluster. There are strategies to choose ϵ d [S8].
2) Number of nearest neighbors (NN): number of points within distance ϵ d to make the smallest cluster. As a rule of thumb, the number of minimum points should be greater than a dimensionality of points and typically chosen as twice the number of dimensions, but it may be necessary to choose larger values for different data sets.
DBSCAN has a number of advantages: it can find non-linearly separable clusters (clusters of any shapes); the number of clusters is not a prior parameter in the method (compared to many other methods); and most importantly for us, it has the notion of noise.
Evaluation of clustering results. To quantify the quality of the clustering results depending on the chosen parameters we used V-measure (or normalised mutual information) score [S9]. For this measure, reference cluster IDs for each point are provided. Those reference IDs are called true labels or ground truth. For our system, we define reference states (ground truth from Ref. [S1] using the volume change) as cluster IDs = -1 for the The V-measure is an average of the other two measures: homogeneity and completeness.
Homogeneity is defined using Shannon's entropy.
2 For the points p and q with Cartesian coordinates p i and q i in n−dimensional Euclidean space, the distance where C is the target clustering (ground truth), |K| k=1 n ck M log n ck n k , M is the size of a data set, n ck is number of samples with the cluster ID c in cluster k and n k the total number of samples in cluster k. The homogeneity is equal to 1 when every sample in cluster k has the same cluster ID c. 0 ≤ hom ≤ 1, with low values indicating a low homogeneity.
Completeness measures whether all similar points are assigned to the same cluster, it is given by The completeness is equal to 1 when all samples of cluster ID c have been assigned to the same cluster k, 0 ≤ comp ≤ 1. Normalised mutual information or V-measure is a measure of the goodness of a clustering algorithm considering the harmonic average between homogeneity and completeness. It is given by where high values indicate good clustering.
The clustering score was computed using the python scikit-learn package [S10]. As shown in Figure S5, it is possible to obtain good clustering scores for quite a big range of NN and ϵ d values on the PCA embedding. For the data shown in the main text we used the Euclidean distance (ϵ d = 0.3) to create a neighbourhood and the minimum number of points (N N = 40) to form a dense region. If we relabel the clusters (blue) to make sure that there is only one cluster in the glassy state, we immediately observe that the error bars (blue bars) in the glassy state drop down to zero and they are maximum near the transition. (b) Averaging using median instead of mean for DBSCAN cluster indices (IDs). The transition at T g becomes even sharper with this type of averaging.
Defining average cluster ID ⟨n(T )⟩ . We perform DBSCAN clustering on the 4dimensional projections to PCA space for each chain separately. DBSCAN defines the high-temperature states as sparse or "noise" (and assigns them with cluster ID = -1) and the low-temperature glassy state as a cluster(s) (Cluster IDs ≥ 0). The change in cluster indices after T = 0.65ϵ/k B is prominent (Figure 1d). Then we repeat this clustering on each chain present in the system (2000 chains), meaning each chain at each frame and temperature will get a cluster ID value (in our example for PCA space cluster IDs are -1, 0, 1 or 2). We observe that for some of the projections in the glassy state, there is more than one group/cluster found corresponding to cluster IDs 0, 1, 2. Then we average over all cluster IDs for all chains ⟨n(T )⟩ that were simulated at each temperature. If we use the mean value for averaging, the average cluster IDs are higher than zero and reflect that more than one cluster was found for this temperature. Such deviation in cluster IDs is shown in large constant error bars in Figure S6a (black line). If we assume that there can be only one cluster in a glassy state and combine all the cluster IDs with 0, 1 or 2, the error bars in the glassy state drop down to zero and they are maximum near the transition region, where assigning the points to a group is the most challenging ( Figure S6a (blue line)). If we use the median for averaging we do not have such deviations from zero ( Figure S6b).

S-VI. PCA ON COMBINED CHAINS
In the main text, we perform PCA on intra-chain distances for an individual chain over simulation frames, followed by taking an averaging over all chains presented in the system.
Here we perform PCA on all 2000 chains together. We use the input data matrix X ∈ R M ×L , where L is the number of descriptors (e.g. internal distance of a single chain of chain length n m = 50: L = n m ×(n m −1)/2 = 1225, same as before), and M is the number of observations (e.g. number of chains multiplied by number of temperatures, 2000 × 20 = 40000). We observe the Gaussian-like distribution for all temperatures in Figure S7 within fluctuations. are dominant. This can be also seen in Figure S8a,b, where intra-chain distance distributions have no peaks at higher values. We should stress once more that due to standardisation of the distances PCA accounts for relative changes rather than the absolute displacement values which suggests that the rearrangements in long, medium, and short ranges have equal importance. Figure S8d shows averaged Pearson correlation coefficients between the projections to the leading PCs, a radius of gyration R g (Eq. (S1)), and an end-to-end distance R e (Eq. (S2)).
Results are obtained by taking the average over all 2000 chains. We find that both R g and R e are correlated with the first PC. Distributions of intra-chain distances mostly correlated (correlation higher than 0.9) with projections to the first (a) and second (b) principal components for all chains compared with all intra-chain distances distributions (blue bars). (c) Distribution of respective chemical distances of intra-chain distances highly correlated with the projection of two leading PCs. No preferences for longer/shorter distances ranges can be observed. d) Pearson correlation coefficients between projections to the leading PCs, a radius of gyration (R g ), and an end-to-end distance (R e ). Color gradient for all projection correspond to the simulation time starting from light-gray to black. The fluctuations of data have the same magnitude after we standardise the input distances at each T . Hence, the PCA projections at T ≫ T g and T < T g look visually similar. Around T g , we see the change in the shape of the projections. This change is quantified using the first eigenvalue and the participation ratio as discussed in the main text. Data taken from the time window between 200τ and 3 × 10 4 τ (gray area in Figure S1). FIG. S10. Temperature dependent PCA projections for a randomly selected chain at individual T from a short MD time data (up to 20τ ) (blue area in Figure S1). No sharp jump in the first eigenvalue or PR around T g ≈ 0.65ϵ/k B is observed.  Figure 1d. The first eigenvalue (c) and PR (d) from individual temperature analysis (compare to Figure 3).

S-VIII. PROJECTIONS OF
In the main text, we use the input data matrix X c ∈ R M ×L , where L is the number of descriptors (the intra-chain monomer-monomer distances of a single chain), and M is the number of observations (M = 3000 for Method I and M = 150 for Method II). To reduce the short-range and highly correlated features we skip ∆ m consecutive monomers such that only ⌊ nm ∆m+1 ⌋ monomers with monomer indices i ∈ {k(∆ m + 1) + 1 : k is an integer with k = 0, 1, . . . , ⌊ nm ∆m+1 ⌋ − 1} from n m monomers in each chain are selected. E.g. for n m = 50 and ∆ m = 4 the monomers with indices i ∈ {1, 6, ..., 41, 46} are selected. Our new descriptor space (L = 10 × (10 − 1)/2 = 45, M remains the same) is relatively lowedimensional compared to the original (L = 1225). In Figure S11, we plot the single chain PCA projection, cluster IDs averaged over all chains, the first eigenvalue and PR from individual temperature analysis (as described in the main text) versus the temperatures T . All results are similar in nature after reducing the input feature space by excluding the contributions from the monomer pairs having chemical distances less than (∆ m + 1) along identical chains (data taken from the gray area of Figure S1). Nonetheless, to avoid the discussion on how many monomers one can skip for each specific system and make the description of our method more general we present the data with all intra-chain distances in the main text as PCA accounts for the highly correlated distances. It is important to note the distinction between the eigenvectors' spaces used for Method I and II. In the first case we have one set of v c,k , k = 1, ..., P over all temperatures. For Method II there are different independent sets of eigenvectors for each T v c,k (T ). The examples of the first eigenvectors for four different chains are shown in Figure S12 for (a) Method I (combined temperatures) (b-i) Method II (individual temperatures). For combined temperatures as well as for lower individual temperatures there are no common pattern/peaks for different chains. However, for the high temperatures ( Figure S12b-e) we observe the similarity in the first eigenvectors, which suggests the importance of chemical distances at around 30 for the PCA projections obtained at this temperatures. The same importance was found for the PCA projections of the Method I (see Figure S8c) and requires further study.