Selecting High-Dimensional Representations of Physical Systems by Reweighted Diffusion Maps

Constructing reduced representations of high-dimensional systems is a fundamental problem in physical chemistry. Many unsupervised machine learning methods can automatically find such low-dimensional representations. However, an often overlooked problem is what high-dimensional representation should be used to describe systems before dimensionality reduction. Here, we address this issue using a recently developed method called the reweighted diffusion map [J. Chem. Theory Comput.2022, 18, 7179–7192]. We show how high-dimensional representations can be quantitatively selected by exploring the spectral decomposition of Markov transition matrices built from data obtained from standard or enhanced sampling atomistic simulations. We demonstrate the performance of the method in several high-dimensional examples.

7192].We show how high-dimensional representations can be quantitatively selected by exploring the spectral decomposition of Markov transition matrices built from data obtained from standard or enhanced sampling atomistic simulations.We demonstrate the performance of the method in several high-dimensional examples.Machine learning is becoming widely used for performing dimensionality reduction in atomistic simulations and general analysis of high-dimensional systems in physical chemistry.

TOC Graphic
][3][4][5][6][7][8][9][10][11][12] The quality of variables resulting from such methods depends heavily on the input data consisting of many configuration variables referred to as features.However, the selection of such high-dimensional representations used subsequently for dimensionality reduction is often overlooked or performed by trial and error.
A possible basis for the interpretable construction of high-dimensional representations in complex systems is the preservation of their timescale separation between slow and fast variables.The slow variables are intrinsically related to the kinetics of rare transitions between long-lived metastable states in configuration space, 13,14 which are essential in many processes, for instance, catalysis, 15 crystallization, 16 or conformational transitions. 17,18The fast variables, however, are adiabatically slaved to the dynamics of the slow variables and correspond mainly to equilibration within metastable states.Therefore, we can consider different representations of the same system equivalent if the same timescale separation characterizes them.
In this Letter, we exploit this idea and develop a method for a quantitative and interpretable selection of the high-dimensional configuration space based on the spectral decomposition of Markov transition matrices from simulation data.0][21][22][23][24][25][26][27] In contrast to such approaches, our method does not use eigenfunctions to parametrize the slow coarse-grained variables.Instead, it iteratively removes variables from the complete high-dimensional representation to find a partial selection of configuration variables while preserving kinetic information about the complete high-dimensional representation.
Let us first introduce the concept of the high-dimensional representation of physical sys-tems.Each microscopic configuration of the system is described by n configuration variables (i.e., features) x = (x 1 , . . ., x n ).In the case of the microscopic coordinates, the configuration variables are sampled from an equilibrium probability distribution given by the Boltzmann density p(x) ∝ e −βU (x) , where U (x) is the potential energy of the system, and is the inverse temperature.However, the equilibrium probability distribution is usually unknown for other high-dimensional spaces (e.g., invariant representations).
To estimate the kinetic information encoded in the high-dimensional space, we collect N samples of n configuration variables from a simulation to construct the Markov transition matrix and perform its spectral decomposition.To this aim, a data set consisting of these samples is: where the samples are augmented by statistical weights w if we sample a biased probability distribution, such as in enhanced sampling simulations.The weights are given as follows: where p(x k ) and q(x k ) are the unbiased and biased probability distributions at the k-th sample from X, respectively.For unbiased simulations, the weights are reduced to unity as they are sampled from the equilibrium distribution.
Next, let us introduce reweighted diffusion map. 12We start by constructing an auxiliary Gaussian kernel which encodes information about the local geometry of the configuration space, , where x k and x l are n-dimensional samples from the data set (eq 1) and ε is a scale constant which is chosen depending on the data set usually selected so that it matches the distance between neighboring samples.Here, we calculate the scale constant as the median of Euclidean distances ∥•∥ for simplicity; see the Supplementary Information for details.Other techniques can also be used when further adjustment of the scale constant is required [28][29][30] .
Then, a reweighted anisotropic kernel is introduced to employ additional information about the density and importance of the configuration space: where ϱ(x) = k g ε (x, x k ) is up to a multiplicative constant a kernel density estimate.In eq 3, we reweight the anisotropic kernel using r(x k , x l ), which is introduced to correct the effect of sampling from the biased probability distribution q (i.e., diffusion reweighting 12 ).
The reweighting factor is given as: 10,12 where w(x k ) and w(x l ) are the statistical weights corresponding to the k-th and l-th samples, respectively.See ref 12 for a derivation of diffusion reweighting and a more general discussion.
For unbiased simulations, eq 4 reduces to the anisotropic diffusion kernel used in diffusion map. 23,31uation 3 asymptotically corresponds to a reversible, overdamped approximation to the slow dynamics with the unbiased probability p(x) as the stationary density 23,31 even if the underlying dynamics proceeds according to the biased probability distribution. 12As such, the reweighted diffusion kernel given in eq 3 is a reasonable approximation for our method.
Having calculated the reweighted anisotropic diffusion kernel (eq 3), we can finally compute the related row-normalized kernel to define the Markov transition matrix M (x k , x l ) with the corresponding transition probabilities m kl : which is equivalent to an unbiased Markov chain given by Pr{x t+1 = x l | x t = x k } that defines a probability that the system transitions from x k to x l in one timestep t.Note that the time in the Markov chain is auxiliary and should not be confused with the simulation timestep or a timelag.Regardless of whether the data set is sampled from the biased probability distribution, the Markov transition matrix corresponds to the unbiased Markov chain after the reweighting. 12weighted diffusion map is very related to target measure diffusion map introduced by Banisch et al. 32 The main difference netween both methods lies in their formulation.Here, we employ general statistical weights from an enhanced sampling simulation (eq 2), and target measure diffusion map uses target probability distributions.Both techniques can be used in our framework.However, working with statistical weights is more suitable for our purposes as, usually, even approximate target measures are unknown.
Here, we use the Markov transition matrix M to determine if the configuration space spanned by a partial selection of the configuration variables contains similar kinetic information as the high-dimensional representation of the system spanned by all its configuration variables.For this, we first perform an eigendecomposition of the Markov transition matrix constructed from the complete set of the configuration variables M ψ = λψ and calculate its eigenvalues {λ k } and eigenfunctions {ψ k }.The eigenvalues are sorted by decreasing values.The corresponding eigenfunctions contain kinetic information about the system as the eigenvalues are related to the intrinsical timescales of the system.
Next, the eigendecomposition is carried out for combinations of the configuration variables, which define a data set X d (d is the number of the configuration variables in the partial representation) and compare it to the eigenvalues of the complete high-dimensional representation.To describe how much kinetic information is conserved in the partial representations, we define a spectral loss: where α is a normalization constant so that the value of the spectral loss for one variable is equal to 1, and λ k and λ dk are the eigenvalues of the complete high-dimensional representation and the partial representation consisting of d configuration variables, respectively.
Therefore, a combination of the configuration variables preserves kinetic information encoded in the complete representation if the spectral loss is close to zero.
Given the data set X of n configuration variables x = (x 1 , . . ., x n ) and its spectral decomposition {λ k , ψ k } of the related Markov transition matrix M , we search for the partial high-dimensional data set X d of d configuration variables that upon spectral decomposition of its Markov transition matrix into {λ dk , ψ dk } contains similar kinetic information as the Markov transition matrix calculated from X.To avoid an exhaustive and computationally demanding search through all combinations of the configuration variables, we use an algorithm that provides a suboptimal result. 33Our algorithm is summarized below: (1) Start from the complete high-dimensional representation.
(2) Repeat until the number of selected configuration variables is d: (a) Remove from the data a variable corresponding to the minimal spectral loss upon removing one variable.
(b) Add a configuration variable to the data if the spectral loss upon addition back decreases.
(c) Go to (a) if adding any configuration variable does not result in decreasing the spectral loss; else, go to (b).
As an initial test for our method, we apply it to four high-dimensional benchmark data sets of dimension n = 10 sampled from multivariate Gaussian distributions that create clusters of samples on vertices of an m-informative hypercube with interdependence between these variables and additional noise.They consist of a different number of informative variables (m = 4, 6, 8, 10).The remaining variables in these data sets are combinations of the informative variables.Further details about these data sets are available in Supplementary Information.We present our results in Figure 1.By computing the spectral loss in reference to the eigendecomposition of the Markov transition matrix from the data set X, we can see that when the number of informative variables (m) is lower in the partial data sets, the data resembles X much faster in comparison to the complete data set for m = 10; see Figure 1(a).
This observation indicates that the method correctly identifies the number of informative variables, as the improvement of the spectral representation of the high-dimensional space upon adding the remaining variables is negligible.
This observation is also supported by checking how the eigenfunctions ψ d0 of the partial high-dimensional representations converge to the eigenvalues calculated based on the Markov transition matrix from the data set X; see Figure 1(b).For example, the eigenfunctions ψ n0 for the m = 4 informative variables resemble ψ 0 much faster.In contrast, for m = 10 informative variables, all the variables must be included in the data set X d to match the eigenvalues and eigenfunctions calculated from the Markov transition matrix from X.
As a subsequent example, we consider alanine dipeptide in vacuum, which has two main long-lived metastable states that its Φ dihedral angle can separate; see Figure 2  1-µs well-tempered metadynamics 37 simulation at 340 K with a bias factor of 5 and select samples every 8 ps from the last 20 ns of the simulation, amounting to the data set consisting of N = 2500 samples.As biased variables to enhance transitions between the folded and unfolded states of chignolin, we choose the distance between Cα atoms of residues Y1 and Y10 and the radius of gyration.The statistical weights are calculated using the Tiwary-Parrinello reweighting algorithm. 38From the resulting trajectory, we calculate the sines and cosines of the backbone Φ and Ψ dihedral angles of chignolin and use them as the complete high-dimensional representation (n = 36 variables in total).
We present the results in Figure 3.The spectral loss calculated for every partial selection of the configuration variables is shown in Figure 3(a).We can see that the spectral loss gradually decreases, reaching similarity with the complete selection for d > 15.This fact can also be observed in Figure 3 showing that the method identifies the variables that can carry the information about the folding and unfolding of chignolin while neglecting the possibly spurious variables.A list of the selected variables can be seen in Table S2 in Supplementary Information.
In this Letter, we present a simple and practical method for selecting high-dimensional representations of complex physical systems sampled by standard and enhanced sampling atomistic simulations.The presented technique is general and requires only simulation data with statistical weights if the data is generated using an enhanced sampling technique.We use well-tempered metadynamics here for enhanced sampling and the Tiwary-Parrinello reweighting to generate data.However, data can be obtained from, for instance, parallel tempering, which does not require defining variables to bias before the simulation and is easy to reweight.In general, the presented algorithm can be tailored to any enhanced sampling technique; however, a reweighting scheme should be chosen as appropriate for the used method.
The method constructs unbiased Markov transition matrices from simulation data and calculates their spectral decomposition for combinations of the configuration variables comprising the complete high-dimensional representation of the system.The selection algorithm creates a high-dimensional representation by iteratively removing the configuration variables from the data so that the spectral decomposition performed on the partial selection of the configuration variables resembles that of the complete representation.This kinetic equivalence is obtained by minimizing the spectral loss that measures the deviation between the eigenvalues obtained from the considered high-dimensional representations.Since the selected partial high-dimensional representation is interpretable and preserves the timescale separation of the complete representation of the system, it can be used as an initial representation for subsequent construction of the slow variables, ensuring that the high-dimensional representation includes all relevant information about the system dynamics.When the timescale separation is ensured in a partial high-dimensional representation, further dimensionality reduction has the essential information about the dynamics of the studied complex physical system.In conclusion, our method can become a helpful approach to analyzing high-dimensional physical systems and has the potential to be further explored.

Figure 1 :
Figure 1: High-dimensional data sets (n = 10) sampled from multivariate Gaussian distributions with a different number of informative variables (m).For additional information about the benchmark data sets, see Supplementary Information.(a) Spectral loss σ d shown for the data sets consisting of m = 4, 6, 8, 10 informative variables as a function of the number of the variables selected as the partial representation of the data set d. Spectral loss is rescaled so that its value for one variable and m = 10 is equal to one.(b) Similarity between the equilibrium distributions calculated as the zeroth eigenfunctions for two selected data sets (m = 4 and m = 10).The eigenfunctions ψ d0 are compared to ψ 0 , i.e., the eigenfunctions of the Markov transition matrix M computed from the complete data set X.

Figure 2 :
Figure 2: Selecting high-dimensional configuration space for alanine dipeptide (Ace-Ala-Nme) in vacuum.(a) Structure of alanine dipeptide with the Φ and Ψ dihedral angles marked.The system is represented by distances between heavy atoms (n = 45) from the data set of N = 2500 samples generated using well-tempered metadynamics biasing Φ and Ψ with a bias factor of 5.For additional information about the simulation, see Supplementary Information.(b) Equilibrium density calculated as the zeroth eigenvector of the Markov transition matrices using reweighted diffusion map in the space spanned by the Φ and Ψ dihedral angles.(c) Spectral loss σ d showing the convergence of the partial selection of the configuration variables to the reference data set at about d ∼ 10 (the number of the configuration variables in the partial representations d is shown on a logarithmic scale).Spectral loss is rescaled so that its value for one variable is equal to one.(d) Eigenvalues {λ dk } (k is the index of the eigenvalue) calculated from the partial data sets X d (d = 1, 2, 3, 10) shown in blue and X shown in magenta.The last shown spectrum for d = 10 is the same as for the complete data set X.
(a).As a high-dimensional representation, we select all heavy-atom distances (the number of variables n = 45) monitored during a 100-ns long simulation.For the data set, we use the last 10 ns of the simulation, sampled every 4 ps.The total number of samples used to construct the Markov transition matrix is N = 2500.The simulation is performed using gromacs

Figure 3 :
Figure 3: Selecting high-dimensional configuration space for chignolin in solvent.The system is represented by the sines and cosines of the Φ and Ψ dihedral angles (n = 36).The data set consists of N = 2500 samples generated using well-tempered metadynamics biasing the distance between the Cα atoms of the Y1 and Y10 residues and the radius of gyration with a bias factor of 5 at 340 K.For details, we refer to Supplementary Information.(a) Spectral loss σ d for the partial high-dimensional representations of X d (the number of the configuration variables in the partial representations d is shown on a logarithmic scale).Spectral loss is rescaled so that its value for one variable is equal to one.(d) Eigenvalues {λ dk } (k is the index of the eigenvalue) calculated for the partial representations consisting of d = 1, 2, 3, 16 variables.The eigenvalues of the Markov transition matrix from the complete set (magenta) are identical to the representation of d = 16 variables (blue).(c) Eigenfunctions ψ 1 and ψ 2 colored as a function of the equilibrium eigenfunction ψ 0 (colorbar) shown for the partial selection d = 16 and the complete data set d = 36.The eigenfunctions are calculated from the d-dimensional representation.(d) Distributions of the ψ 0 and ψ 1 eigenfunctions obtained based on the partial (d = 16) and complete selection (d = 36) of the configuration variables.
(b), where the eigenvalues of the Markov transition matrix calculated for d = 16 are converged in reference to the eigenvalues computed from the complete data set.Moreover, the eigenfunctions ψ 1 and ψ 2 shown as a function of the equilibrium eigenfunction ψ 0 calculated based on the selections of the configuration variables for d = 16 and d = 36 are very similar; see Figure 3(c).Additionally, in Figure 3(d), we show that the distributions of the ψ 0 and ψ 1 eigenfunctions for the partial selection of the configuration variables (d = 16) and the complete representation (d = 32) are also in agreement.Our calculations show that less than half of the configuration variables in the complete high-dimensional representation is needed to represent the system.Interestingly, the configuration variables selected for d = 16 correspond to every residue of chignolin,