Reweighted Manifold Learning of Collective Variables from Enhanced Sampling Simulations

Enhanced sampling methods are indispensable in computational chemistry and physics, where atomistic simulations cannot exhaustively sample the high-dimensional configuration space of dynamical systems due to the sampling problem. A class of such enhanced sampling methods works by identifying a few slow degrees of freedom, termed collective variables (CVs), and enhancing the sampling along these CVs. Selecting CVs to analyze and drive the sampling is not trivial and often relies on chemical intuition. Despite routinely circumventing this issue using manifold learning to estimate CVs directly from standard simulations, such methods cannot provide mappings to a low-dimensional manifold from enhanced sampling simulations, as the geometry and density of the learned manifold are biased. Here, we address this crucial issue and provide a general reweighting framework based on anisotropic diffusion maps for manifold learning that takes into account that the learning data set is sampled from a biased probability distribution. We consider manifold learning methods based on constructing a Markov chain describing transition probabilities between high-dimensional samples. We show that our framework reverts the biasing effect, yielding CVs that correctly describe the equilibrium density. This advancement enables the construction of low-dimensional CVs using manifold learning directly from the data generated by enhanced sampling simulations. We call our framework reweighted manifold learning. We show that it can be used in many manifold learning techniques on data from both standard and enhanced sampling simulations.


I. INTRODUCTION
Among the main challenges in atomistic simulations of physical systems is the significant temporal disparity between the timescales explored in standard atomistic simulations and the long timescales observed in experiments.Atomistic simulations can only reach timescales of up to milliseconds and thus cannot exhaustively sample the high-dimensional phase space, leading to the socalled sampling problem that has both theoretical and computational consequences for dynamical systems.The reason for the sampling problem is that these systems are characterized by many metastable states (i.e., highprobability regions) separated by energy barriers (i.e., low-probability regions) much higher than thermal energy ( k B T ).This leads to the kinetic entrapment of the system in a single metastable state, as on the timescales obtained in standard atomistic simulations, transitions to other metastable states are infrequent events.Such transitions between metastable states can be related to a few slow degrees of freedom that define a low-dimensional energy landscape.Examples of physical * jr@fizyka.umk.plprocesses exhibiting metastability are phase transitions in crystals [1] and glass transitions in amorphous materials [2].
A possible resolution to the sampling problem is given by enhanced sampling methods [3][4][5][6][7].Over the years, various strategies for enhanced sampling have emerged, e.g., tempering, variational, or biasing approaches; see Ref. [7] for classification and references therein.In this article, we consider a class of such enhanced sampling methods based on the work by Torrie and Valleau [8], which devised a framework for enhanced sampling that modifies the Boltzmann probability distribution by introducing a bias potential acting in a low-dimensional space of collective variables (CVs) that correspond to slow degrees of freedom.However, identifying the reduced space of these CVs capturing the underlying physical processes must be done before enhanced sampling simulations; it is far from trivial and often relies on experience and physical intuition.Consequently, many datadriven approaches are used to perform dimensionality reduction and construct CVs using samples directly from exploratory trajectories [9][10][11][12][13][14][15][16][17][18].
An example of such data-driven approaches is manifold learning [19].The core of most manifold learning methods is having a notion of similarity between arXiv:2207.14554v1[physics.chem-ph]29 Jul 2022 high-dimensional data samples, usually through a distance metric [20][21][22].The distances are integrated into a global parametrization of the data using kernels to represent a Markov chain containing information about transition probabilities that can be used to learn a smooth and low-dimensional manifold that captures the essentials of the data.This way, we can employ dimensionality reduction methods to learn CVs corresponding to slow degrees of freedom.We can distinguish two main approaches that manifold learning methods take to obtain a mapping to a low-dimensional representation of data: (i) eigendecomposition [23][24][25][26][27][28][29][30][31] and (ii) divergence optimization [21,22,32].
When using manifold learning on dynamical data resulting from atomistic simulations, these data must contain statistically sufficient information about the sampled physical process.If a high-dimensional data set used in manifold learning does not capture the rare transitions between metastable states, the learned low-dimensional CVs will neither.Unbiased atomistic simulations by construction sample only a fraction of the available configuration space, and generally capture fast equilibrium processes.Therefore, employing such unbiased simulations as learning data sets for manifold learning methods can lead to undersampled and non-optimal CVs that do not capture the slow degrees of freedom corresponding to the rare physical processes.
We can circumvent this issue by using learning data set from enhanced sampling simulations where transitions between metastable states are more frequently observed and are no longer rare events.However, in this case, the simulation data set is biased and does not correspond to the real physical system, as it is sampled from a biased probability distribution.Using these biased simulation data directly in manifold learning algorithms renders low-dimensional manifolds that are also biased (i.e., their geometry, density, and importance) and thus CVs that do not correspond to the physical process.Therefore, in manifold learning, we need to correctly take into account that we use biased simulation data when learning CVs from enhanced sampling simulations.Despite several attempts in this direction [14,15,[33][34][35][36][37], this area remains unexplored.
In this work, we consider the problem of using manifold learning methods on data from enhanced sampling simulations.We provide a unified framework for manifold learning to construct CVs using biased simulation data, which we call reweighted manifold learning.To this aim, we derive a pairwise reweighting procedure inspired by anisotropic diffusion maps, which accounts for sampling from a biased probability distribution.We term this procedure diffusion reweighting.Our framework considers the underlying geometry, density, and importance of the simulation data to construct a low-dimensional manifold for CVs encoding the most informative characteristics of high-dimensional dynamics of the atomistic system.
Our general framework can be used in many manifold learning techniques on data from both standard and en-hanced sampling atomistic simulations.We show that our diffusion reweighting procedure can be employed in manifold learning methods that use both eigendecomposition or divergence optimization.We demonstrate the validity and relevance of our framework on both a simple model potential and high-dimensional atomistic systems.

II. THEORY
In this section, we introduce the theory behind CVs and enhanced sampling (Sec.II A), reweighting (Sec.II C), and biased data (Sec.II D) that we need to derive diffusion reweighting (Sec.II E).

A. Collective Variables
In statistical physics, we consider an n-dimensional system specified in complete detail by its configuration variables x ∈ R n .These configuration variables indicate the microscopic coordinates of the system or any other variables (i.e., functions of the microscopic coordinates) relative to the studied process, e.g., an invariant representation.As a result, such a statistical representation is generally of high dimensionality.
In general, the configuration variables x are sampled during a simulation according to some, possibly unknown, high-dimensional probability distribution P (x) that has a corresponding energy landscape U (x) given by the negative logarithm of the probability distribution and an appropriate energy scale.If x consists of the microscopic coordinates, this distribution is known and is the stationary Boltzmann distribution: where U (x) is the potential energy function of the system, the canonical partition function is Z = dx e −βU (x) , and β −1 = k B T is the thermal energy with T and k B denoting the temperature and Boltzmann's constant, respectively.Without loss of generality, we limit the discussion to the canonical ensemble (N V T ) here.
The high-dimensional description of the system is very demanding to work with directly; hence, many classical approaches in statistical physics were proposed to introduce a coarse-grained representation, e.g., the Mori-Zwanzig formalism [38,39] or Koopman's theory [40].
To reduce the dimensionality of the high-dimensional space and obtain a more useful representation with a lower number of degrees of freedom, we map the configuration variables to a limited number of functions of the configuration variables, or so-called CVs.A corresponding target mapping ξ is the following: where d is the number of CVs (d n) and {ξ k } are CVs.
The parametrization of the target mapping is performed to retain the system characteristics after embedding into the low-dimensional CV space (Fig. 1).In contrast to the configuration variables x, there are several requirements that the optimal CVs should fulfill, i.e., (i) they should be few in number (i.e., the CV space should be low-dimensional), (ii) they should correspond to slow modes of the system, and (iii) they should separate relevant metastable states.If these requirements are met, we can quantitatively describe rare events.
Let us assume the target mapping and the CVs are known.Then, we can calculate the equilibrium marginal distribution of CVs by integrating over other variables: where the Having the marginal equilibrium probability, we can define the free-energy landscape in the CV space as the negative logarithm multiplied by the thermal energy: In practice, free-energy landscapes for systems severely affected by the sampling problem are characterized by many metastable states separated by high kinetic barriers that impede transitions between metastable states.Consequently, on the timescales we can simulate, the system stays kinetically trapped in a single free-energy minimum and cannot explore the CV space efficiently.

B. Enhanced Sampling
CV-based enhanced sampling techniques overcome the sampling problem by introducing a bias potential V (z) acting in the CV space designed to enhance CV fluctuations.The functional form of the bias depends on the enhanced sampling method used [4,7,8,[41][42][43].The bias potential can be static [8] or adaptively constructed on the fly during the simulation [4,7,[41][42][43].Regardless of how the bias potential is constructed, it leads to a biased CV distribution that is smoother and easier to sample than the unbiased distribution [Eq.( 3)]: where • V denotes the biased ensemble average and the biased partition function is Z V = dz e −β(F (z)+V (z)) .CV-based enhanced sampling methods construct the bias potential to reduce or entirely flatten free-energy barriers.Let us consider well-tempered metadynamics [42], which is the method we employ in this work.Well-tempered metadynamics uses a history-dependent bias potential updated iteratively by periodically depositing Gaussians centered at the current location in the CV space.The bias potential is given as: where G σ (z, z l ) is a Gaussian kernel with a bandwidth set σ, z l is the center of l-th added Gaussian, and γ is a bias factor that determines how much we enhance CV fluctuations.Well-tempered metadynamics convergences to a biased CV distribution given by the so-called welltempered distribution: which we can view as sampling an effective free-energy landscape F/γ with barriers reduced by a factor of γ.

C. Reweighting
Biasing results in gradual diverging from the equilibrium CV distribution to a smoother and easier to sample biased CV distribution, i.e., from Eq. (3) to Eq. ( 7) in the case of well-tempered metadynamics.Consequently, the importance of each sample is given by a statistical weight needed to account for the effect of the bias potential when obtaining equilibrium properties such as the free-energy landscape.This contrasts with unbiased simulations where samples are equally important as they are sampled according to the equilibrium distribution.
A functional form of the weights depends on a particular method.Generally, for methods employing a bias potential V (z), the weight associated with a CV sample z can be written as: In the case of a static bias, the weights are given by Eq. (8).In contrast, well-tempered metadynamics uses an adaptive bias potential [Eq. ( 6)], and we need to account for a time-dependent constant given by [4,44]: which is independent of z.We can then redefine the weights as: where V (z) + c is called the relative bias potential.Note that in the above discussion, we assume that the dependence of the bias potential on the simulation time is implicit.We can ignore the time dependence once the simulation reaches convergence, then the relative bias potential V (z) + c is quasi-stationary and does not change considerably (the bias potential V (z) and the FIG. 1. Target mapping from high-dimensional samples of configuration variables x to a low-dimensional manifold spanned by CVs z.In our framework, learning CVs is equivalent to finding the optimal parametrization of the target mapping z = ξ(x) [Eq.( 2)].The target mapping performs the reduction from R n to R d so the relation p kl between the high-dimensional samples x k and x l is preserved in the relation q kl in a low-dimensional manifold between the CV samples z k and z l .For a detailed discussion, see Secs.II E and III B.
time-dependent constant c can still increase while their sum converges).In practice, when performing reweighting, we ignore a short initial transient part of the simulation where the relative bias potential is still changing considerably.
The standard reweighting works by employing the weights to obtain the stationary equilibrium distribution from the biased CV distribution, i.e., P (z) ∝ w(z)P V (z).The unbiased probability distribution P (z) can be computed by histogramming or kernel density estimation, where each sample z is weighted by Eq. ( 8).This is done routinely in advanced simulation codes, e.g., plumed [45,46].
Manifold learning methods cannot use the standard reweighting to unbias pairwise relations between samples.Instead, a non-trivial approach to reweighting in a form of r(x k , x l ) is required, where r(x k , x l ) is a pairwise reweighting factor that characterizes the importance of relation between samples x k and x l .

D. Biased Data for Manifold Learning
Given the requirements for the optimal CVs (Sec.II A), it is non-trivial to provide low-dimensional CVs knowing only the microscopic coordinates.Instead, we often resort to an intermediate description and select a large set of the configuration variables (often called features).For example, this might be internal coordinates such as distances or dihedral angles, and so forth.These configuration variables then define a high-dimensional space which we reduce to the optimal low-dimensional CVs.For a list of helpful configuration variables to characterize different physical systems, see, for example, the plumed documentation [47].
Consider data obtained from enhanced sampling simulations in which we record or select samples of the highdimensional configuration variables x.These data define the training set from which manifold learning methods construct a low-dimensional manifold.The training data set can be generally expressed as: where K is the number of samples and the sample set is augmented by the corresponding statistical weights.
Note that the weights depend on x through the CV mapping [Eq.( 2)].

E. Diffusion Reweighting
Geometrically, the existence of a low-dimensional representation assumes that the high-dimensional dynamical system populates a low-dimensional manifold.This assumption is known as the manifold hypothesis [33].Under this view, the fast degrees of freedom are adiabatically slaved to the dynamics of the slow degrees of freedom, which correspond to the optimal CVs, due to the presence of fast equilibration within the metastable states.Methods leveraging this assumption belong to a class of manifold learning techniques.
The core of manifold learning methods appropriate for dimensionality reduction in dynamical systems is the construction of a random walk through a Markov chain on the data set, where the transition probabilities p kl depend on a kernel function and distances between samples.Depending on how the transition probabilities p kl are used to find a target mapping to a lowdimensional manifold, we can distinguish two main approaches: (i) eigendecomposition [23][24][25][26][27][28][29][30][31] and (ii) divergence optimization [21,22,32].In manifold learning methods using eigendecomposition, eigenvalues and eigenvectors are used to construct the target mapping.In methods employing divergence optimization, however, the transition probabilities p kl are used to find a Markov transition matrix q kl constructed from low-dimensional samples (Fig. 1).
Although many kernels can be considered in manifold learning, a typical choice in spectral embedding methods is a Gaussian kernel dependent on Euclidean dis-tances [20,27]: where ε is a positive parameter chosen depending on the given data set as it induces a length scale ∼ √ ε that should match the distance between neighboring samples.Eq. ( 12) models the Markov transition matrix if every row is normalized to unity.However, this construction includes information only on the manifold geometry given by the pairwise distances.The remaining components required for our reweighting approach are the density and importance of the data.
For the Markov transition matrix, the reweighting procedure must be reformulated to include the weights w(x k ) and w(x l ) for a pair of samples x k and x l , respectively.Our plan is to derive such a pairwise reweighting formula where each pairwise transition probability given by the Markov transition matrix M (x k , x l ) depends also on a reweighting factor r(x k , x l ).We assume that a reweighted Markov transition matrix can be defined in a simple form: where M is row-stochastic.The Markov transition matrix models then the unbiased Markov chain where each entry is the probability of the jump from x k to x l .
To account for the manifold density, we need to employ a density-preserving kernel.In contrast to Laplacian eigenmaps that are appropriate for data sampled uniformly [20,26], diffusion map allows working with data sampled from any underlying probability distribution.Specifically, let us consider the pairwise transition probabilities based on an anisotropic diffusion kernel given by [27]: where ρ(x) is a kernel density estimator and α ∈ [0, 1] is the anisotropic diffusion parameter, which is crucial to properly include information about the data density and importance [28].Based on the anisotropic diffusion parameter, diffusion map can be used to parametrize a family of low-dimensional embeddings.In Eq. ( 14), the density estimator ρ(x k ) at a sample x k must be reweighted to account on the data importance: which is a weighed kernel density estimate up to an unimportant multiplicative constant.After the reweighting, the density estimator characterizes the unbiased density, in contrast to the biased density estimate that is given as: where the subscript V denotes that the density estimate is calculated under the bias potential V .In theory, if the underlying probability distribution of high-dimensional samples is known analytically, it is possible to express ρ directly from this distribution [30]; e.g., from a Boltzmann distribution [Eq.( 1)] if the samples are represented by the microscopic coordinates.However, this is valid only in the case of sufficient sampling and thus rarely reachable in practice.Moreover, the highdimensional distribution P (x) of the configuration variables is unknown in general (Sec.II A).For this reason, we write ρ as a kernel density estimate [Eq.(15)].
We can understand the physical meaning behind the anisotropic diffusion kernel by considering Eq. ( 14).The dynamics described by Eq. ( 14) is local as samples closer to each other have a higher probability of being close in the respective low-dimensional manifold and vice versa in the case that they are farther apart.This information about the underlying geometry is given by G ε (x k , x l ) which requires that the transition probabilities are penalized between geometrically distant samples x k and x l .The density and importance of samples are encoded in the unbiased density estimates [Eq.( 15)].
Depending on α value in Eq. ( 14), three interesting cases of diffusion maps can be considered asymptotically [28].Namely, (i) for α = 1 2 , Eq. ( 14) corresponds to the Markov chain that is an approximation of the diffusion given by the Fokker-Planck generator with the underlying data density proportional to the equilibrium density, allowing us to approximate the long-time behavior of the microscopic coordinates.Other values of α are also possible, e.g., (ii) for α = 0, we get the classical normalized graph Laplacian, and (iii) for α = 1, we ignore the underlying density and the diffusion operator approximates the Laplace-Beltrami operator.We note that this asymptotic behavior holds in the limit of infinite data K → ∞ and ε → 0 when considering the microscopic coordinates.As we are interested in finding low-dimensional CVs, the case for α = 1  2 is appropriate to model asymptotically the slowest degrees of freedom, accounting for both the underlying geometry and density of the manifold.
As we have all the required ingredients for the reweighting of Markov transition matrices, we focus on deriving the reweighting factor.Here, we discuss only an outline, while a detailed derivation is provided in Appendix A.
Based on Eq. ( 14), the Markov transition matrix can be estimated by weighting each Gaussian term and normalizing it so that it is row-stochastic: Next, by inserting Eq. ( 14) to Eq. ( 17), we can see that the Markov transition matrix M can be written also using the Gaussian kernels: where we can recognize the reweighting factor by comparing the result to Eq. ( 13).Therefore, we get the following expression: We can also approximate the reweighting factor by rewriting Eq. ( 19) with the biased density estimate [Eq.( 16)]: where we set α = 1 2 .Eq. ( 20) is a final form of the reweighting factor that we use here.A detailed derivation of Eq. ( 20) is provided in Appendix A. Although the derivation of Eq. ( 20) is presented using the Gaussian kernel, our framework can be used in other manifold learning methods, as demonstrated in Sec.III.
Eq. ( 18) denotes an unbiased Markov chain with the transition probability from x k to x l in one time step t given by: We term our reweighting procedure diffusion reweighting.We postulate that the derived Markov transition matrix [Eq.(18)] has the following three properties that make the construction of Eq. ( 21) from enhanced sampling simulations feasible.Namely, the Markov transition matrix encodes the information about: 1. Geometry G ε (x k , x l ): The probability of transitions between samples lying far from each other is low and high for those in close proximity.
3. Importance w(x l ): The statistical weights from enhanced sampling decide accordingly to the bias if a sample is important, i.e., metastable states where the weights are higher are more important then high freeenergy regions.

F. Implementation
Our framework is implemented in a development version of plumed 2.7 [45,46] as the LowLearner module and will be made publicly available in the coming future.

III. REWEIGHTED MANIFOLD LEARNING
We incorporate diffusion reweighting into several manifold learning methods and apply them to find a lowdimensional representation in a model system and highdimensional atomistic simulation problems represented by biased simulation data.Specifically, we consider diffusion reweighting in diffusion maps [28][29][30] and recently introduced stochastic embedding methods for learning CVs and adaptive biasing [14,15].
To demonstrate the validity of our framework, we apply diffusion map to standard testing systems such as a particle moving on an analytical potential and alanine dipeptide.For the stochastic embedding methods, we choose a mini-protein chignolin.For the two atomistic systems, alanine dipeptide and chignolin, we describe the systems using two different types of high-dimensional representations (distances and dihedral angles, respectively) to show that the framework can work regardless of the chosen configuration variables.

A. Diffusion Maps
We start by considering the case of diffusion maps on which we base the derivation of the reweighting factor r(x k , x l ) (Sec.II E).By rewriting the diffusion kernel using the biased density estimates [Eq.( 20)], we can use it to construct a low-dimensional embedding from a biased data set.We directly use Eq. ( 18) to estimate the transition probabilities while using Eq. ( 20) to account for the sampling from any biased distribution.

Target Mapping ξ(x): Eigendecomposition
With the exemption of the reweighting factor, further steps in our approach to diffusion maps proceed as in its standard formulation [28].Let us briefly recap these steps.
In diffusion maps, the spectral decomposition of the Markov transition matrix M is performed to define a low-dimensional embedding, M ψ = λψ, where {λ l } and {ψ l } are eigenvalues and eigenvectors, respectively.The eigenvalues are related to the effective timescales as τ l = − 1 log λ l and can be used to determine the slowest processes in the dynamics.Then, the eigenvectors corresponding to the largest eigenvalues define a reduced space.Given this interpretation, the target mapping [Eq.(2)] is defined by the diffusion coordinates: where ξ(x) is computed using the first d eigenvalues and eigenvectors with the the equilibrium density represented by the zeroth coordinate λ 0 ψ 0 .In Eq. ( 22), the spectrum of the eigenvalues {λ l } is sorted by non-increasing value, The truncation up to d − 1 of Eq. ( 22) for metastable systems corresponds to a negligible error on the order of O(λ d /λ d−1 ) [28].In other words, this assumption relates to a large spectral gap that separates slow degrees of freedom (> λ d−1 ) and fast degrees of freedom (< λ d−1 ).For a detailed description behind the construction of the diffusion coordinates, we refer to works by Coifman [27,29,30].

Algorithm
The described algorithm for our reweighted diffusion maps is given in Algorithm 1.

Algorithm 1: Reweighted Diffusion Maps
Output: Eigenvalues {λ k } and eigenvectors {ψ k } of the transition matrix M used to construct the target mapping ξ. 1. Calculate the squared pairwise distances x l − x l 2 .
(c) Normalize to obtain the row-stochastic matrix M .

Example: Model Potential
As a simple and illustrative example of applying diffusion reweighting within the diffusion map framework, we consider a case where dimensionality reduction is not performed.Namely, we run an enhanced sampling simulation of a single particle moving along the x variable on a one-dimensional potential U (x) with three Gaussianlike metastable states with different energy depths and energy barriers between the minima [Fig.2(a)].In this system, the highest energy barrier is ∼50 k B T , which makes the transitions from the deepest minimum rare.The dynamics is modeled by a Langevin integrator [48] using temperature T = 1, a friction coefficient of 10, and a time step of 0.005.We employ the pesmd code in the plumed [45,46] plugin.We bias the x variable using well-tempered metadynamics [42] with a bias factor of γ = 10.Further details about the simulation are given in Supplemental Material [49].
We present our results in Fig. 2(b).We can see that the non-reweighted (without applying diffusion reweighting) diffusion map learns the biased distribution (given by λ 0 ψ 0 ) along the coordinate x where the three energy minima correspond to the maxima of the biased distribution.Additionally, the first two diffusion coordinates are not orthogonal, and there is a lack of separation between the metastable states.
In contrast, the reweighted diffusion map can represent the equilibrium density (λ 0 ψ 0 ) where only the first energy minimum is populated due to the high-free energies separating the states.The λ 0 ψ 0 and λ 1 ψ 1 diffusion coordinates properly separate the samples.We can see that λ 1 ψ 1 is almost marginal due to the lack of additional dimensions for the potential energy.
The example presented in Fig. 2 is, of course, a trivial case in which no dimensionality reduction is performed; however, it indicates that diffusion reweighting can be used to reweight the transition probabilities successfully and that the standard diffusion map trained on the biased data captures an incorrect representation.

Example: Alanine Dipeptide
As a next example, we consider alanine dipeptide (Ace-Ala-Nme) in gas phase described using the Amber99-SB force field [50].The data set is generated by a 100-ns molecular dynamics simulation [51,52] using the gromacs 2019.2code [53] patched with a development version of the plumed [45,46] plugin.The simulation is performed by well-tempered metadynamics [42] at 300 K using the backbone dihedral angles Φ and Ψ for biasing and a bias factor of 5. Using this setup, the convergence of the bias potential is obtained fairly quickly.Further details about the simulation are given in Supplemental Material [49].
Using diffusion maps, we reduce the high-dimensional space consisting of all pairwise distances between the heavy atoms (n = 45) to two dimensions.The diffusion maps are constructed using ε = 0.078 estimated as the median of the pairwise distances.
We present diffusion reweighting results for alanine dipeptide in Fig. 3.The eigenvalues of the Markov transition matrix have a spectral gap (i.e., timescale separation) with only a few eigenvalues close to one and all other eigenvalues much smaller than one.Thus, only the first few eigenvectors are needed to approximate the diffusion coordinates [Eq.(22)] and thus the target mapping to the CV space.The eigenvalues {λ l } indicate that the spectral gap is slightly wider for the reweighted transition probability matrix, as can be seen in Fig. 3(b).Consequently, the effective timescales τ l = − 1 log λ l calculated from the eigenvalues indicate that the reweighted diffusion map corresponds to slower processes; see Supplemental Material [49].
We can see that the non-reweighted approach cannot correctly account for the transition probabilities calculated based on the biased simulation, as we expected.The transitions between the metastable states are so frequent that the first diffusion coordinate (the equilibrium density) suggests only one metastable state [Fig.3(c)].In Supplemental Material [49], we show that the separation of samples in the reweighted diffusion map is much better than for the non-reweighted diffusion map.It resembles a "typical" diffusion map from unbiased data sets.
In the reweighted case, the low-dimensional coordinates can distinguish between the relevant metastable states.Additionally, using Eq. ( 20) the first diffusionmap coordinate, λ 0 ψ 0 (x), correctly encodes the information about the Boltzmann equilibrium distribution of alanine dipeptide in the dihedral angle space, which is not possible using the standard (i.e., non-reweighted) diffusion map in the case of biased simulation data [Fig.3(c)].
By comparing the reweighted diffusion map to a diffusion map constructed from an unbiased parallel tempering replica at 300 K, we can see that the embeddings and eigenvalues are virtually identical; see Supplemental Material [49].
These results further corroborate our findings and show that when performing a dimensionality reduction from data resulting from enhanced sampling, the reweighting factor [Eq. (20)] is needed to revert the effect of biasing in the transition probability matrix.

B. Stochastic Embeddings
Next, we move to employ diffusion reweighting in more recent approaches.We consider manifold learning methods devised primarily to learn CVs from biased simulation trajectories: multiscale reweighted stochastic embedding (mrse) [15] and stochastic kinetic embedding (stke) [14].These methods use approximations of the reweighting factor [Eq. ( 20)].Our aim is not to compare results obtained using these methods but to present and discuss how diffusion reweighting can be approximated and employed in manifold learning methods other than diffusion maps.
First, let us focus on a general procedure these stochastic embedding methods use to parametrize manifolds.Mainly, we discuss how these methods use the Markov transition matrices to parametrize the target mapping to low-dimensional manifolds.The construction of the Markov transition matrix with reweighting from biased data in each technique is discussed in the remainder of this section.

Target Mapping ξ θ (x): Divergence Optimization
As mentioned above, the stochastic embedding methods belong to the second category of manifold learning methods we consider here, i.e., based on divergence optimization.Thus, unlike diffusion maps, the eigendecomposition is not performed in these methods.Instead, the target mapping ξ is parametrized based on neural networks that perform nonlinear dimensionality reduction.The target mapping is given as: where θ = {θ k } are parameters of the target mapping adjusted such that the low-dimensional manifold of CVs is optimal with respect to a selected statistical measure.Using Eq. ( 23), the distance between samples in a manifold can be given as: Note that in some simple cases, the mapping in Eq. ( 23) can also be represented using a linear combination.However, deep learning has been successful in a broad range of learning problems, and using more intricate approximations for the mapping between highdimensional and low-dimensional spaces is quite common for complex data sets [54,55].
The target mapping is parametrized by comparing the Markov transition matrix M (x k , x l ) = (p kl ) (Sec.III B 2) constructed from the high-dimensional samples to a Markov transition matrix Q(z k , z l ) = (q kl ) built from low-dimensional samples mapped using the target mapping [Eq.(23)].In stke, we use a Gaussian kernel for In mrse, we employ a one-dimensional t-distribution, as implemented in t-sne [22,55].Taking the target mapping as defined in Eq. ( 23), the transition probabilities in the low-dimensional space q kl in mrse are: The choice of the t-distribution for Q in mrse is motivated by the apparent crowding problem [22], i.e., as the volume of a small-dimensional neighborhood grows slower than the volume of a high-dimensional one, the neighborhood is stretched so that moderately distant sample pairs are placed too far apart.As outlined in Refs.[22], the use of a heavy-tailed distribution for the low-dimensional representation allows moderate distances in the high-dimensional space to be represented by much larger distances in the manifold, encouraging gaps to form in the low-dimensional map between the clusters present in the data, alleviating the crowding problem to some degree.
Finally, the Markov transition matrices computed from the high-dimensional and low-dimensional samples need to be compared.The most common choice for such a metric is employing a statistic distance, particularly the Kullback-Leibler divergence: where in contrast to the standard formulation of the Kullback-Leibler divergence that compares two probability distributions, Eq. ( 27) is computed for every pair of rows from M and Q, and then summed.There are many choices possible for the comparison between M and Q, e.g., cross-entropy [17,32] or the Jensen-Shannon divergence [14].The Kullback-Leibler divergence optimization is performed to train the target mapping represented by a neural network.As the target mapping is parametric, the gradients of D KL with respect to the parameters θ = {θ k } of the neural network can be estimated effortlessly using backpropagation.For further details about training neural networks, we refer to Appendix E.

Reweighted Markov Transitions
After explaining how the parametric mapping is constructed in the reweighted stochastic embeddings, we proceed to formulate the Markov transition matrices and the reweighting factors for these methods.
First, let us consider the reweighting performed in mrse [15].This method employs the following reweighting factor: where we neglect the biased density estimates ρ V [cf.
Eq. ( 28) and Eq. ( 20)].The reweighting factor [Eq. ( 28)] written as a geometric mean between two statistical weights can be justified by the fact that the bias potential is additive, as shown in Eq. ( 8), and a geometric mean is appropriate to preserve this relation.We note that similar reweighting procedures have been used in Refs.[36,37,56].The Markov transition matrix in mrse is expressed as a Gaussian mixture, where each Gaussian is evaluated for different ε values and reweighted using Eq. ( 28): where we omit the normalization constant for brevity.
The sum in Eq. ( 29) is over bandwidths that are automatically estimated and selected to fit that data.Note that many methods can be used for this purpose; however, to facilitate analysis, we use a method from Ref. [15].As this procedure is mostly technical, for details about estimating bandwidths and constructing the Gaussian mixture, we refer to Appendix C. Second, let us consider stke.Suppose highdimensional samples are resampled so that each sample keeps a certain distance away from the others.In that case, the distribution of samples can be viewed as approximately uniform.Then, w(x) can be replaced by the unbiased probability density estimator ρ(x) in Eq. ( 28).Thus, the reweighting factor is given by: which is the formula used in stke [14,57].The corresponding Markov transition matrix is: where, as in Eq. ( 29), the k-th reweighting term is canceled out during the normalization.An interesting property of the transition probabilities used by this method is that by taking an approximation to the normalization constant (Appendix B), we arrive at a transition probability matrix of similar form as in the square-root approximation of the infinitesimal generator of the Fokker-Planck operator [58][59][60][61]: for a single ε.The square-root approximation has been initially derived by discretizing a one-dimensional Smoluchowski equation [62].It can also be shown that Eq. ( 32) can be obtained using the maximum path entropy approach [63,64].As many algorithmic choices are available for each procedure incorporated in the reweighted stochastic embedding framework, it is difficult to directly compare mrse and stke.However, we aim to discuss how approximations of the reweighting factor are employed in these methods and how they can be used to learn CVs from biased data.Thus, in the above discussion, we focus on the reweighting procedures for the Markov transition matrices used by these methods.To compare the parameters used by these methods, see Appendix E.

Algorithm
For a general algorithm used by the stochastic embedding techniques to find a low-dimensional manifold of data, see Algorithm 2. 2. Estimate the Markov transition matrix M (x k , x l ) according to the method used and reweight M using the approximation of the reweighting factor (Sec. III B 2).
(c) Repeat until convergence reached.
4. Map the high-dimensional samples to CVs using ξ(x), where optimal parameters are given by arg min DKL(θ).

Example: Chignolin
As an example for the two stochastic embedding methods mrse and stke, we consider folding and unfolding of a ten amino-acid miniprotein chignolin (CLN025) [65] in the solvent.We employ the CHARMM27 force field [66] and the TIP3P water model [67], and we perform the molecular dynamics simulation [51,52] using the gromacs 2019.2code [53] patched with a development version of the plumed [45,46] plugin.Our simulations are performed at 340 K for easy comparison with other simulation data, also simulated at 340 K [68,69].We perform a 1-µs well-tempered metadynamics simulation with a large bias factor of 20.We select a high bias factor to illustrate that our framework is able to learn metastable states in a low-dimensional manifold even when free-energy barriers are virtually flattened, and the system dynamics is close to diffusive at convergence.
As biased CVs to enhance transitions between the folded and unfolded conformations of CLN025 in the metadynamics simulation, we choose the distance between Cα atoms of residues Y1 and Y10 (d) and the radius of gyration (r g ) [Fig.4(c)].We consider CLN025 conformations folded when the distance is below ∼0.8 nm and unfolded otherwise for > 0.8 nm; see the corresponding time series in Supplemental Material [49].From the resulting trajectory, we calculate the sines and cosines of all the backbone Φ and Ψ dihedral angles and use them as the high-dimensional representation of CLN025, which amounts to 32 variables in total.We collect highdimensional samples every 1 ps for the biased training data set.Then, the low-dimensional manifolds are trained on representative samples selected as described in Refs.[14,15].As we focus mainly on the Markov transition matrices and diffusion reweighting here, we provide a detailed discussion about the subsampling procedures in Appendix D.
In Fig. 4, we present the resulting manifolds spanned by the trained CVs computed using the reweighted stochastic embedding methods (Sec.III B).The embedding presented in Fig. 4(a) is calculated using mrse [15], while the embedding presented in Fig. 4(b) is calculated using stke [14], using their corresponding reweighting formulas given by Eq. ( 28) and Eq. ( 30), respectively.For each manifold, the corresponding free-energy landscapes are calculated using kernel density estimation using the weights to reweight each sample [Eq.(10)].
We can observe that the free-energy landscape in the low-dimensional manifold calculated by mrse is highly heterogeneous, with multiple partially unfolded intermediate states and many possible reaction pathways, as shown in Fig. 4(a).Such a complex free-energy landscape shows that the dynamics of CLN025 is more intricate and complex than it is visible in the free-energy surface spanned by the distance and the radius of gyration [Fig.4(c)], where we can see only the folded, intermediate, and unfolded states and remaining are possibly degenerate.
In Fig. 4, we can see the lower-lying free-energy basins in the reweighted stochastic embeddings are captured by both mrse and stke.We can also notice a slight difference between the metastable states lying higher in free energy.Specifically, mrse captures more states below a threshold of 25 kJ/mol in comparison to the embedding rendered by stke, in which the rest of the states is placed over 25 kJ/mol (i.e., mainly different unfolded states).
In our simulations, we do not observe a misfolded state of CLN025 shown to be highly populated in several studies [70,71] employing different force fields (Am-ber99 [72] and Amber99-SB [50], respectively) compared to CHARMM27 here [66].This misfolded state is also not observed in the long unbiased simulation from Ref. [68] that employs the same CHARMM27 force field as we do.
Comparing the free-energy barriers between the differ-ent embeddings in Fig. 4, we can see that they are similar, particularly for the mrse embedding and the free-energy surface spanned by the distance and the radius of gyration, i.e., from 10 to 15 kJ/mol.We can compare our results to the unbiased simulation data from the study of Lindorff-Larsen et al. [68] where the authors perform a very long simulation and observe a significant number of folding and unfolding events, thus allowing unbiased estimates of free-energy barriers to be obtained.In their study, CLN025 was shown to be a "fast folder" with the corresponding free-energy barrier of ∼10 kJ/mol.Similar estimates have also been obtained in Ref. [69].Therefore, we can conclude that the free-energy barriers in the embeddings agree well with previous computational studies.
Note that the simulation of CLN025 performed in Ref. [68] is ∼100 µs long compared to our 1-µs simulation.This clearly illustrates the great benefit of combining manifold learning with the ability to learn from biased data sets.
Overall, both the separation of the CLN025 metastable states and the free-energy landscapes calculated for the low-dimensional embeddings suggest that the proposed framework can be used to find slow CVs and physically valid free-energy estimates.The presented results [Fig.4] clearly show that using our approach, we can construct a meaningful and informative low-dimensional representation of a dynamical system from a biased data set, also when employing strong biasing (i.e., the high bias-factor regime in the case of well-tempered metadynamics).
We underline that diffusion reweighting makes learning CVs from high-dimensional samples possible regardless of which conformational variable is biased to generate the data set.This extends the applicability of manifold learning methods to atomistic trajectories of any type (unbiased and biased) and makes it possible to learn CVs from a biased data set where the sampling is faster and more evident than in an unbiased data set.

IV. CONCLUSIONS
Nonlinear dimensionality reduction has been successfully applied to high-dimensional data without dynamical information.Dynamical data is a unique problem with different characteristics than generic data.Standard dimensionality reduction employed in analyzing dynamical data may result in a representation that does not contain dynamical information.This problem is even more pronounced in enhanced sampling, where we sample a biased probability distribution and additional assumptions on data structure have to be made.As such, manifold learning methods require a framework with several modifications that would allow working on trajectories obtained from enhanced sampling simulations.In this work, we introduce such a framework.
The main result of our work is introducing the reweighting procedure for manifold learning methods that use the transition probabilities for building low- dimensional embeddings.These advancements enable us to directly construct a low-dimensional representation of CVs from enhanced sampling simulations.We show how our approach can be leveraged to reconstruct slow CVs from enhanced sampling simulations even in high bias-factor regimes.Our framework can be further exploited in constructing a low-dimensional representation for dynamical systems using other manifold learning methods.For instance, it could be used in spectral embedding maps [20,26] or stochastic neighbor embedding (e.g., t-sne) [21,22,55].There are numerous stages at which such methods have scope for different algorithmic choices.Consequently, many possible algorithms can work within our framework.An interesting direction for further research is to combine diffusion reweighting with a metric different from Euclidean distance, for instance, by considering a metric that enables introducing a lag time, as done in the case of kinetic and commute maps [73][74][75], a Mahalanobis kernel [76,77], or delay coordinates [78].Diffusion reweighting can be extended to yield intrinsic timescales directly from enhanced sampling simulations, based on their relation to eigenvalues.We plan to take this road in the near future.
We underline that the presented diffusion reweighting can be used in any enhanced sampling method as the method can work with any functional form of the weights.For instance, tempering methods such as parallel tempering [79] can be used, where the weights are given as e −∆βU for the difference in the inverse temperatures ∆β between the simulation temperature and the target temperature.
Our framework makes it possible to generate biased data sets that, given the construction of enhanced sam-pling methods, sample a larger conformational space than standard atomistic simulations and use such data to learn low-dimensional embeddings.If a data set entails many infrequent events, the low-dimensional representation is more prone to encode them quantitatively.Moreover, in the case of the reweighted stochastic embedding methods, which we cover here, the generated embeddings can be used for biasing in an iterative manner, e.g., where we iterate between the learning and biasing phases.We believe that the accurate construction of the Markov transition probability matrix is a crucial element in implementing such an algorithm optimally without being restricted by kinetic bottlenecks (i.e., low-probability transition regions).

FIG. 2 .FIG. 3 .
FIG.2.Diffusion maps generated for the reweighted and non-reweighted (without applying diffusion reweighting) biased simulation of a particle in a simple (a) one-dimensional potential U (x) where the energy barriers separating the deepest minimum are on the order of 50 kBT , and the corresponding transitions from this state are rare events.(b) A comparison between the non-reweighted (blue) and reweighted (red) diffusion maps: the equilibrium densities along the coordinate x and diffusion coordinates λ0ψ0 vs. λ1ψ1, with coloring according to the x value.The enhanced sampling simulation is performed using well-tempered metadynamics[42] with a bias factor of 10 by employing the pesmd code in the plumed[45,46] plugin.

FIG. 4 .
FIG. 4. Reweighted stochastic embeddings calculated for chignolin in the TIP3P solvent at 340 K simulated using the CHARMM27 force field.Low-dimensional manifolds are colored according to their free energy.(a) Representative conformations from the metastable states estimated by the reweighted embedding methods are shown around the mrse embedding.(b) The embedding obtained using stke.Well-tempered metadynamics is used to generate the training set consisting of sines and cosines of all Φ and Ψ dihedral angles, amounting to 32 variables in total.The training set is generated by performing a 1-µs simulation with a bias factor γ = 20, enhancing the fluctuations of the distance d between the Cα atoms of residues Y1 and Y10 and the radius of gyration rg.(c) The free-energy surface calculated along for d and rg.The axes and units for the embeddings are arbitrary and thus not shown.See Supplemental Material [49] for computational details.
. acknowledges funding from the Polish Science Foundation (START), the National Science Center in Poland (Sonata 2021/43/D/ST4/00920, "Statistical Learning of Slow Collective Variables from Atomistic Simulations"), and the Ministry of Science and Higher Education in Poland.M. C. and T. K. G. acknowledge the support of Purdue Startup Funding.Calculations were performed on the Opton cluster at the Institute of Physics, NCU.