Protein Folding Free Energy Landscape along the Committor - the Optimal Folding Coordinate

: Recent advances in simulation and experiment have led to dramatic increases in the quantity and complexity of produced data, which makes the development of automated analysis tools very important. A powerful approach to analyze dynamics contained in such data sets is to describe/approximate it by di ﬀ usion on a free energy landscape - free energy as a function of reaction coordinates (RC). For the description to be quantitatively accurate, RCs should be chosen in an optimal way. Recent theoretical results show that such an optimal RC exists; however, determining it for practical systems is a very di ﬃ cult unsolved problem. Here we describe a solution to this problem. We describe an adaptive nonparametric approach to accurately determine the optimal RC (the committor) for an equilibrium trajectory of a realistic system. In contrast to alternative approaches, which require a functional form with many parameters to approximate an RC and thus extensive expertise with the system, the suggested approach is nonparametric and can approximate any RC with high accuracy without system speci ﬁ c information. To avoid over ﬁ tting for a realistically sampled system, the approach performs RC optimization in an adaptive manner by focusing optimization on less optimized spatiotemporal regions of the RC. The power of the approach is illustrated on a long equilibrium atomistic folding simulation of HP35 protein. We have determined the optimal folding RC - the committor, which was con ﬁ rmed by passing a stringent committor validation test. It allowed us to determine a ﬁ rst quantitatively accurate protein folding free energy landscape. We have con ﬁ rmed the recent theoretical results that di ﬀ usion on such a free energy pro ﬁ le can be used to compute exactly the equilibrium ﬂ ux, the mean ﬁ rst passage times, and the mean transition path times between any two points on the pro ﬁ le. We have shown that the mean squared displacement along the optimal RC grows linear with time as for simple di ﬀ usion. The free energy pro ﬁ le allowed us to obtain a direct rigorous estimate of the pre-exponential factor for the folding dynamics.


INTRODUCTION
Due to advances in computer hardware and simulation methodology, it is becoming increasingly easier to generate large simulation data sets of complex molecular systems, with a prominent example being the long equilibrium trajectories of fast folding proteins. 1,2 Because of the complexity of dynamics and high-dimensionality of the resulting trajectories, the generation of many trajectories per se is not sufficient to provide full scientific insight. Eventually it becomes necessary to synthesize the data into as faithful as possible a picture of the process of interest. Given the growing size and complexity of simulations, analysis and interpretation of such data are widely recognized as fundamental bottlenecks in the application of atomistic simulations. 3− 6 A fundamental way to analyze a simulation is to determine the underlying free energy landscape, i.e., the free energy as a function of one or more reaction coordinates (RCs), collective variables (CV), or order parameters. 1,5,7−10 Generally, one is interested in finding free energy minima or metastable states, pathways, transitions states (TS), and free energy barriers. The major difficulty in such an analysis is the selection of appropriate RCs. A poorly chosen RC can result in a misleadingly simple free energy landscape with missing minima and the absence or underestimation of barriers. 3,7 Experience has shown that RCs chosen based on intuition or using common methods such as principal component analysis (PCA) are usually suboptimal. Hence a large number of methods have been suggested to determine good RCs or CVs in an where X denotes the position in multidimensional configuration space, U is the potential energy, D is the diffusion tensor, β = 1/(kT), k is the Boltzmann constant, and T is temperature. Given two boundary states A and B, the committor, q(X), is the solution of the adjoint equation 19,20 with boundary conditions q = 0 for X ∈ ∂A and q = 1 for X ∈ ∂B. The committor is thus a complex, high-dimensional function, which is the solution to the high-dimensional partial differential equation. It has been determined accurately only for a small number of low-dimensional model systems. 21 Determining the committor for a realistic complex system of interest is a very difficult unsolved problem. Moreover, in practice, one needs to determine the committor from a long equilibrium trajectory, rather than from U(X) and D(X). While a number of approaches have been suggested to determine the committor, 5 they all have serious drawbacks. In particular, putative RCs determined for realistic systems cannot pass proposed committor validation tests. 8,22 Here we present a solution to this important problem. We describe an approach that accurately determines the committor, so it can pass the validation tests, and illustrate its performance by analyzing a long equilibrium protein folding trajectory. To apply the committor validation test, coordinate r is first transformed to the committor as a function of this coordinate r → q(r) using the Markov state model formalism. 5,27 Then Z C,1 (q(r), Δt) profiles are computed for Δt/Δt 0 = 1, 4, 4 2 , ···, 4 8 . The closer Z C,1 (q(r), Δt) to N AB = 73 is, the more optimal coordinate r is. For the optimal coordinate or committor Z C,1 (q, Δt) = N AB . For suboptimal coordinates Z C,1 (q(r), Δt 0 ) > N AB or −ln Z C,1 (q(r), Δt 0 ) < −ln N AB . As Δt increases −ln Z C,1 (q(r), Δt) increases as well and reaches the limiting values of −ln N AB for large Δt. a and b) the C α root-mean-square deviation from the native structure. c and d) the fraction of native contacts. e and f) putative RC obtained via the nonparametric optimization after 200 iterations.

Journal of Chemical Theory and Computation
Article 2. METHOD 2.1. Nonparametric Variational Optimization of RCs. Variational approaches appear to be most promising for RC optimization. 5 A functional form (FF) with many parameters R(X, α i ) is suggested as an approximation to an RC. One numerically optimizes the parameters α i by optimizing a particular functional, for example, the probability of being on a transition path, 8,23 the likelihood functional, 17 the cut profiles, 9,24,25 or the total squared displacement. 22,26 Here we consider the last one. Given a long equilibrium multidimensional trajectory X(kΔt 0 ), where Δt 0 is the trajectory sampling interval, one computes the reaction coordinate time-series r(kΔt 0 ) = R(X(kΔt 0 ), α i ). Here and below r defines an arbitrary reaction coordinate, while q is reserved for the committor. Given two boundary states A and B, the optimal coordinate between them (the committor) is the one that provides a minimum to the total squared displacement Δr 2 = ∑ k [r(kΔt 0 + Δt 0 ) − r(kΔt 0 )] 2 , under the constraints that r(k ∈ A) = 0 and r(k ∈ B) = 1, i.e., the boundary states A and B map to 0 and 1, respectively. It is straightforward to prove this principle by assuming that the system dynamics is described by a Markov state model. The total squared displacement equals Δr 2 = N∑ ij P ji (Δt)P i (r j − r i ) 2 , where P ji (Δt) is the transition probability matrix from state i to state j after Δt, P i is the equilibrium probability, N is the total number of snapshots in the trajectory, and r i is the position of microstate i on the RC. Differentiating with respect to r k and assuming the detailed balance P ji (Δt)P i = P ij (Δt)P j , one obtains the following equation for committor q 5,22 Note that the assumption that systems dynamics is described by a Markov state model is used only for the derivation of equations. One does not need to perform the actual construction of such a model, which means that this assumption does not restrict the applicability of the algorithm. The theoretical minimum value of the functional, attained for r = q, equals Δq 2 = 2N AB , 22 where N AB is the total number of transitions from state A to B. Thus, if during RC optimization Δr 2 /2 reaches N AB , it follows that the putative RC equals the committor. During optimization of an RC for a finitely sampled system it is possible to obtain Δr 2 /2 < N AB , i.e., a value of the functional that is lower than the theoretical lower bound. In this Figure 2. Optimization of RC for HP35 double mutant. The free energy profile (a) and Z C,1 profiles (b) for RC r obtained with the nonparametric approach. While RC r is close to the committor it still deviates from it. The free energy profile (c) and Z C,1 profiles (d) for RC q obtained with the adaptive nonparametric approach. RC q closely approximates the committor. Z C,1 (r, Δt) are computed for Δt/Δt 0 = 1, 4, 4 2 ,···, 4 8 . Solid black lines on (b) and (d) show Z C,1 (r, Δt 0 ). b) Schematic representation of the multivalued character of the optimal RC. Its value increments by 2 every time one goes around the full circle. It is analogous to a multivalued angle, whose value increments by 2π, i.e., z ∼ ϕ/π.

Journal of Chemical Theory and Computation
Article case we say that the putative RC overf its the trajectory. Because of the usage of many fitting parameters the RC starts to approximate the statistical noise due to limited sampling rather than the actual dynamics.
A major weakness of these approaches is that it is difficult to suggest a good FF approximating the RC. The difficulty becomes apparent if one remembers that such an RC should be able to accurately project a few million snapshots of a very highdimensional trajectory. In particular, it implies an extensive knowledge of the system, and that such a FF is likely to be system specific.
Recently we have suggested a nonparametric approach, which bypasses the difficult problem of finding an appropriate FF. 12 Since Δr 2 depends explicitly only on the RC time-series r(kΔt 0 ), one may directly optimize the values of r(kΔt 0 ) rather than the parameters α i of the FF R(X, α i ).
However, r(kΔt 0 ) values cannot be varied independently of each other, 12 because points close in the original multidimensional space should have close projections (R(X) is a continuous function), i.e., if X(iΔt 0 ) ∼ X(jΔt 0 ), then r(iΔt 0 ) ∼ r(jΔt 0 ). To vary r(kΔt 0 ) in an appropriate, concerted way, one improves r(kΔt 0 ) in the following iterative manner: r′(kΔt 0 ) = f(r(kΔt 0 ), y(kΔt 0 )), where r′(kΔt 0 ) is the updated values of the RC time-series, y(kΔt 0 ) is the time-series of a randomly chosen coordinate of the original multidimensional space X, and f(x, y) is a low degree polynomial. If the system obeys some symmetry (e.g., the rotational and translational symmetries for biomolecules), then the RC should obey the same symmetry. A simple way to ensure this is to use as y(kΔt) collective variables that respect the symmetry. For analysis of protein dynamics, one can use distance time-series between randomly chosen pairs of atoms. Another possibility is to use time-series of sin or cos of a randomly selected dihedral angle. The flowchart of the algorithm is provided in Figure 7 in the Appendix.
The idea has been successfully tested on an extensively sampled 50 dimensional model system, with a trajectory of 10 6 steps containing 989 transitions (N AB ). 12 After 9100 iterations Δr 2 /2 reached 988.9. Continuation of the optimization for 100000 iterations in total insignificantly decreased Δr 2 /2 to 986.5, indicating that no notable overfitting is possible and that the putative RC should be very close to the committor, which was confirmed by applying the committor validation tests. 22,27 The optimization has improved the seed RC, even though the difference between the corresponding free energy profiles at the top of the TS was only 0.05 kT, i.e., the approach is very sensitive.
However, it is not always possible in practice to perform an extensive sampling of a system of interest. A typical example is the simulation of protein folding, where it is very computationally expensive to obtain a handful of folding−unfolding events. 1,2 In this case the direct application of the simple approach described above leads to Δr 2 /2 ≪ N AB , which is an indication of severe overfitting. Here we describe the approach, which allowed us to determine the optimal RC or committor for a typical realistic system of interest, namely the atomistic folding trajectory of a double mutant of HP35.
Briefly, the idea is as follows. As we show below, an RC is likely to be optimized in a nonuniform manner: it is easier to optimize TSs rather than free energy minima. Consequently, some parts of the RC may be optimized more than others. For an extensively sampled system, where overfitting is not possible, this does not present a problem. As some, more optimized parts of the RC reach their optimal values, they cannot be improved further, and the only way to decrease the functional is to optimize the suboptimal parts of the RC. For a system with limited sampling, where overfitting is possible, one needs to find a way to make the optimization uniform and stop it as soon as all the parts of the RC are optimal. Using the protein folding trajectory as an example, we first describe how one can detect the time scales and the regions of the putative RC which are most suboptimal. Then we describe how one can improve uniformity of optimization by focusing on these suboptimal regions and time scales.
2.2. Identification of Suboptimal Spatiotemporal Regions. Suboptimal spatiotemporal regions can be detected by using Z C,1 (r, Δt) cut-profiles, an important quantity for RC analysis, which can be straightforwardly computed from RC time-series r(kΔt 0 ) and whose properties we briefly summarize below (more details are provided in the Appendix). 5,22 The integral ∫ Z C,1 (r, Δt) dr equals Δr 2 (Δt)/2, hence Z C,1 (r, Δt) can be interpreted as a position dependent density of the Δr 2 (Δt)/ 2 functional. To optimize the entire RC one needs to minimize the average of Z C,1 (r, Δt), and to optimize the RC in a particular region one needs to minimize Z C,1 (r, Δt) in that region. For a suboptimal RC, Z C,1 (r, Δt) values generally decrease to the limiting value of N AB , as Δt increases. The larger the difference between Z C,1 (r, Δt 1 ) and Z C,1 (r, Δt 2 ) the less optimal the RC around r. For the optimal RC or the committor Z C,1 (q, Δt) = N AB , which can be used as a committor validation test. 27 Thus, our aim is to determine an RC time-series r(kΔt 0 ), such that −ln Z C,1 (r, Δt) ≈ − ln N AB up to statistical uncertainty, roughly estimated as N 1/ 2 AB . We consider a long equilibrium folding−unfolding trajectory of HP35 Nle/Nle double mutant consisting of 1509392 snapshots at 380 K. 28 The boundary states are defined using rather stringent criteria to ensure that only the configurations from the respective basins are obtained: node B in the native state is defined by the C α root-mean-square deviation (rmsd) from the native 2f4k pdb structure 29 smaller than 1.0 Å, and node A in the denatured state is defined by the C α rmsd greater than 10.5 Å (Figure 1a). The total number of transitions between these nodes, determined from the trajectory, is N AB = 73. Figure 1 shows the free energy profiles and Z C,1 profiles (the committor validation test) for two popular conventional RCs, the rmsd and the fraction of native contacts (Q), 30 and compares them with the putative RC obtained with the nonparametric approach starting from the coordinate initialized to zero. Z C,1 profiles show that the conventional RCs are far from being optimal and are worse than the putative RC already after 200 iterations. The FEP F(r) (Figure 1e) shows the main transition state (TS) barrier separating the denatured and native states and that the native state contains two basins.
After 33700 iterations Δr 2 /2 has reached the stopping value of N AB . Figures 2a-b show the FEP and the Z C,1 (r, Δt) profiles, respectively. As one can see Z C,1 shows relatively large variations, especially in the regions around the free energy minima, i.e., the difference between −ln Z C,1 (r, Δt) and −ln N AB is significantly larger than ≈ N 1/ 2 0.08 AB . It means that while the putative RC is close to the committor (cf. Figure 1), it still deviates from it. This is due to the following reasons. First, variability of Z C,1 (r, Δt 0 ) along r indicates that the RC is optimized in a nonuniform way. It is well optimized in the TS

Journal of Chemical Theory and Computation
Article region, where Z C,1 (r, Δt 0 ) is constant, and much less so around the minima.
Consider an analytical equation for the committor along a single coordinate: dq/dx ∼ D(x) exp[F(x)/kT]. Assuming that D(x) is relatively constant, the equation shows that regions with high free energy (barriers) get exponentially magnified compared to the regions with low free energy (minima). When one tries to update the RC using a low degree polynomial r′ = f(r, y), it is difficult to simultaneously update the entire RC on two very different scales. Increasing the polynomial degree might help; however, a more efficient solution is to update the segments with vastly different scales separately. For example one may use different low degree polynomials for different segments. Second, the fact that −ln Z C,1 (r, Δt 0 ) is higher than the other −ln Z C,1 (r, Δt) around the TS indicates that the latter are less optimized than the former, e.g., the RC is optimized in a temporally nonuniform way.
2.3. Improving Spatial Uniformity. To identify such less and more optimized spatiotemporal segments in an automatic way, i.e., without user intervention, we suggest the following procedure. During optimization the variability of Z C,1 (r, Δt) is monitored to determine the regions of RC which are less optimized. Namely, the larger the difference between −ln Z C,1 (r, Δt′) and −ln Z C,1 (r, Δt) for some Δt′ > Δt the less optimal is the coordinate in the region around r. One finds such Δt′ > Δt, for which the nonuniformity of the distance between the profiles are considered to be less optimized. The less and more optimized segments of the RC are updated using different polynomials of higher and lower degree, respectively. Here we used polynomials of fifth and second degree. A polynomial of lower, fourth, degree was not sufficiently flexible to improve suboptimal regions. One can also just update the less optimized segments while keeping the rest of the RC fixed. This simple procedure improves the spatial uniformity of optimization.
2.4. Improving Temporal Uniformity. Temporal uniformity can be improved by optimizing the RC with longer sampling intervals. However, one cannot simply optimize Δr 2 = ∑ k [r(kΔt + Δt) − r(kΔt)] 2 , with, e.g., Δt = 2Δt 0 . The optimal RC corresponding to Δt > Δt 0 differs from that corresponding to Δt = Δt 0 . 22 An intuitive way to understand the difference is to note that when a trajectory is observed with a longer sampling interval, one may miss the events when the system visits a boundary node and quickly comes back, thus underestimating the probability to end up at the boundary. More formally, if the RC r i satisfies eq 2a for Δt = Δt 0 , then it is straightforward to show that r i satisfies the same equation for Δt = kΔt 0 , where P ji (kΔt) = P k (Δt) ji . However, the equation is not satisfied by the boundary nodes, which satisfy eq 2b. In particular, eq 2a means that the average displacement from every point is zero, and for a boundary point it is not true: all points are either smaller or larger than it.
One way to overcome this problem is to eliminate boundaries without modifying the RC by joining two identical copies of the RC into a circle as shown in Figure 3a. 31,32 When at states A or B, the system can follow either of the RC copies with equal probability. Then the average displacement from points A and B is zero due to symmetry, i.e., eq 2a is valid for all points. The optimal RC (the solution of eq 1 or eq 2a) on a circle is a multivalued function. 31 For example, consider diffusion on a circle with constant U(ϕ) and D(ϕ), where ϕ is the angle. Eq 1 reads ∂ 2 q/∂ϕ 2 = 0, with the solution q = ϕ/π, meaning q is multivalued similar to ϕ: after making the full circle q is incremented by 2 (2 RCs of length 1).
In practice this multivalued RC on a circle, denoted by z (Figure 3b), is constructed from single valued RC r as follows. Consider a circle with a perimeter of 2 (radius 1/π) with coordinate z along the perimeter (Figure 3a). States A and B correspond to points with angle ϕ = π and ϕ = 0, correspondingly (Figure 3a). The points divide the circle into two segments: lower half γ 0 for ϕ ∈ [−π: 0], where z(r) = r and upper half γ 1 for ϕ ∈ [0: π], where z(r) = 2 − r. These points, correspondingly, divide the multivalued RC into different segments or branches z m (Figure 3b). For each segment z m one has the following correspondence: z m (r) = m + r for even m = 2l i.e., for segments on γ 0 , and z m (r) = m + 1 − r for odd m = 2l + 1 i.e., for segments on γ 1 . The segment number time-series m(kΔt 0 ) is determined from seed RC time-series r(kΔt 0 ) as follows. Whenever the trajectory visits a boundary node, it selects with equal probability which of the two adjoint segments it will follow. For example, when the trajectory, currently on segment z 0 , visits node B, it selects with equal probability z 0 or z 1 , if it visits node A, it selects between z 0 and z −1 . Once determined, m(kΔt 0 ) are kept fixed during optimization (the boundary states snapshots do note move), only r(kΔt 0 ) can change. Such constructed multivalued function z satisfies eq 2a for any value of Δt, and thus one can optimize Δz 2 (Δt) = ∑ k [z m(kΔt+Δt) (r(kΔt + Δt)) − z m(kΔt) (r(kΔt))] 2 for any value of Δt. Note that Δz 2 (Δt 0 ) = Δr 2 (Δt 0 ).
2.5. Adaptive Nonparametric Optimization. The ideas described above are combined into a simple optimization algorithm. Starting with RC initialized to zero, one iteratively improves the RC by nonparametrically minimizing Δz 2 (Δt), using polynomials of fifth and second degrees for less and more optimized segments of the RC, respectively. Every 5 iterations Z C,1 profiles are scanned to identify these segments. The sampling interval Δt is changed randomly every 50 iterations as Δt = 2 [10η] Δt 0 , where η is a uniformly distributed random number and [···] denotes an integer part. For Δt > Δt 0 optimization continues while Δz 2 (Δt)/2 > 1.15N AB , while for Δt 0 it continues while Δz 2 (Δt 0 )/2 > N AB . The flowchart of the algorithm is provided in Figure 8 in the Appendix. Figures 2c-d show the results obtained with the approach (cf. Figure 1). The variability of ln Z C,1 is uniformly decreased; it is roughly bounded by ±0.08, which means that we are close to the inherent statistical noise, and further improvement of the results makes little sense.

RESULTS
3.1. The FEP as a Function of the Optimal RC. Using the committor for the analysis and description of the folding dynamics may not be very convenient as the diffusion coefficient varies significantly along the coordinate D(q) = J AB /P eq (q) = Z H (q) −1 N AB /Δt, where P eq (q) is the equilibrium probability or Z H (q) is the corresponding histogram density computed from q(kΔt 0 ). 5,19,20,22 It is more convenient to use a "natural" coordinate, which we denote as q, where the diffusion coefficient is constant D(q) = 1 and that is related to the committor by the following monotonous transformation dq/dq = D(q) −1/2 . 24 Since the transformation is monotonous, the new coordinate is as good as the committor for the description of the dynamics. Figure 4 shows the free energy profile F(q) as a function of q. Note that D(q) = 1 in units where time is measured in timesteps of 0.2 ns.

Article
The free energy profile F(q) is relatively smooth in the denatured state (D) and the TS, while the native state (N) has many deep minima and high barriers. It is consistent with experimental observations that the native state has many conformational substates. 33 The substates differ structurally only locally. The high barriers are likely due to the compact structure of the native state: to perform any local conformational change, the protein needs to partially unfold first.
A single reaction coordinate does not show multiple pathways explicitly. However, if free energy basins that belong to different pathways do not overlap, they can serve as fingerprints to distinguish different pathways. For example, there is a clear separation between distributions of residence times on each transition path in the deep basin defined by |q− 0.863| < 0.0002 ( Figure 5). About 29 of a total of 146 pathways   a The numbers show the latter, while percentages in the brackets show the relative difference between the two. Times are given in ns.

Journal of Chemical Theory and Computation
Article belong to the second distribution with much longer mean residence times. It suggests that this basin is an intermediate state, that belongs to a minor pathway.
3.2. The Diffusive Model Reproduces Important Dynamical Quantities. The profile F(q) with the diffusion coefficient D(q) or F(q) and D(q) define a diffusive model of the dynamics projected on the committor. According to the theory this model should provide a rather accurate description of folding dynamics of the protein. 5,19,20,22 In particular, the following important dynamical quantities can be computed exactly: the equilibrium flux J AB = N AB /(NΔt 0 ) where N is the total number of trajectory snapshots, the mean first passage times (mfpt), and the mean transition path times (mtpt). Also, the mean squared displacement grows linearly with time as for simple diffusive motion.
The quantities were computed as 5 By selecting two points a and b (a < b) on the committor q, one can define two new boundary states: A′ contains all the points with q < a, and B′ contains all the points with b < q. The optimal RC for the new boundary states can be obtained by simple rescaling of the original RC: 5 q′ = (q − a)/(b − a) for a < q < b; q′ = 0 for q < a; and q′ = 1 for b < q. Hence, the equilibrium flux, the mfpt, and the mtpt can be computed exactly between any two such states. Table 1 compares these quantities computed from the diffusive model using eqs 3−5 and directly from the trajectory for different boundary states along the RC. We selected the original boundary states and free energy minima. The relative differences are around the expected statistical error for q of 8%. The differences can be reduced even further, if one removes non-negligible systematic bias due to the finite value of the trajectory sampling interval Δt 0 (see Appendix, Table 3). Figure 6 shows the mean squared displacement (MSD) as a function of time computed for different RCs. The MSD of the multivalued optimal RC (z) follows a simple diffusive law, i.e., it grows linearly with time ⟨Δz 2 ⟩ = 2N AB Δt/(NΔt 0 ). The MSD of q follows that of z for small Δt, when the system does not yet feel the boundaries, and approaches the limiting value for large Δt. The MSD of the other two popular suboptimal RCs, the fraction of native contacts and the C α rmsd from the native structure, shows subdiffusive behavior: ⟨Δr 2 ⟩ ∼ Δt α with α ∼ 0.45 and α ∼ 0.35, respectively. This illustrates that one of the reasons that the dynamics of various protein degrees of freedom is subdifusive is because these degrees are not optimal RCs. 5,22,34−36 3.3. The Pre-Exponential Factor. A fundamental problem in the analysis of protein folding dynamics, and an active area of research, is the determination of the free energy barrier ΔF and the pre-exponential factor k 0 , which are related to the folding rate as k f = k 0 e −ΔF/kT . Direct determination of these quantities from experiment has been hampered by very limited spatial and temporal resolution. The situation has significantly improved recently 37−40 e.g., one can now directly estimate the transition path times by counting single photons. However, the interpretation of the experiments still assumes a particular shape of the folding free energy landscape, which cannot be established in a direct manner. In this sense, the following quote from Yang and Gruebele 41 summarizes the experimental situation: "Without sufficient knowledge of the critical reaction coordinate for describing the motion represented by ν + [here k 0 ] it is impossible to relate experimentally determined folding rates rigorously to computed free energy barriers.".
Having determined the optimal RC q̃and the corresponding FEP F(q), which provide a quantitative description of the folding dynamics, we are now in a position to rigorously determine these quantities in a direct manner. We first note that to uniquely determine the height of the barrier, which is not invariant to monotonous transformations of RC (compare F(q) with F(q)), one needs to impose the constraint that the diffusion coefficient is constant. We estimate k 0 in three different ways. Taking the folding barrier of 4kT ( Figure 4) and k f −1 = τ f = 3054 ns (Table 1) one finds k 0 −1 = 55 ns. Applying the harmonic approximation to the Kramer's equation for the mfpt and assuming that the curvature at the denatured state and the unfolding basin are approximately equal, one can derive the following estimate k 0 −1 = 2πτ corr , where τ corr = kT/(Dω) 2 is the autocorrelation decay time at the TS. 42 The TS is approximated by a parabola with (ω 2 /2)/kT = 0.023, leading to k 0 −1 = 27 ns. Assuming diffusive dynamics over the parabolic TS (with the barrier height over 2kT), Szabo derived the following relation between the mtpt and k 0 where γ = 0.577 is Euler's constant. Taking points with q̃= 38 and q̃= 55 or q̃= 36.5 and q̃= 58.6 as boundaries for computing the mtpt (Table 1) one finds k 0 −1 = 18 ns or k 0 −1 = 24 ns, respectively. For boundaries at q̃= 17 and q̃= 68 or q̃= 1.7 and q̃= 83 one finds k 0 −1 = 186 ns or k 0 −1 = 888 ns, respectively. As one notes the estimate strongly depends on the choice of the boundaries.
We argue that the correct choice of boundaries is q̃= 38 and q̃= 55 or q̃= 36.5 and q̃= 58.6. Inside these boundaries the free energy profile is (approximately) parabolic, and thus the assumptions used to derive eq 6 are satisfied. While the boundaries are closer to the TS than to the minima of the denatured and native states, the mfpt between them captures 65% of the folding time. In other words, eq 6 is rather accurate, if applied to the parabolic part of the TS.
FEPs along suboptimal RCs are rather smooth with no apparent barriers in the native basin (see, e.g., Figures 1c and  e), which may lead one to the erroneous conclusions that an estimate based on eq 6 is valid for boundaries taken far from the TS, e.g., at the bottoms of the basins. The diffusive model cannot be used for a quantitative description of the dynamics projected on such an RC, because the projected dynamics is subdiffusive. For example, the difference between the mtpt computed from the diffusive model and from the trajectory for the number of native contacts RC (Figure 1c) is 1145%.

CONCLUDING DISCUSSION
We have presented an approach to determine the optimal RC or committor for realistic systems with limited sampling. The approach is nonparametric and can approximate any RC with

Journal of Chemical Theory and Computation
Article high accuracy. It can be readily applied to any system, avoiding prior analyses required to suggest a good system-specific functional form approximating the RC. In order to optimize the RC in a uniform manner we introduced adaptive optimization over different spatiotemporal regions, which required the introduction of a multivalued RC.
The approach was applied to the equilibrium folding trajectory of the HP35 double mutant. The determined RC closely approximates the committor as was validated by the optimality criterion − Z C,1 is constant up to the expected statistical noise. We have demonstrated that important dynamical properties -the equilibrium flux, the mean first passage times, and the mean transition path times between any two regions on the RC can be computed exactly, up to statistical uncertainty. The mean squared displacement of the optimal RC grows linearly with time as for simple diffusion. We emphasize that no fitting of the parameters of the diffusive models was employed and that an accurate description is achieved at the trajectory time scale of 0.2 ns. Using this RC we obtained a direct rigorous estimate for the pre-exponential factor of k 0 −1 ∼ 30 ns. To determine the optimal RC one needs to specify the boundary states A and B, which is often done by using order parameters, e.g., the root-mean-square deviation from the native structure here. However, such a definition may lead to poor results in more complex systems, where conventional order parameters may not be sensitive enough. One possibility to properly define boundary states in such systems is to use dominant eigenvectors, determined by the nonparametric approach. 12 While the process of nonparametric optimization of eigenvectors is not stable, its initial stable phase could be sufficient to define the boundary states.
The problem of finding an optimal RC for the description of complex dynamics is not unique to protein folding or molecular dynamics in general. Consider, for example, the problem of accurate description, monitoring and prognosis of disease dynamics. We assume that disease dynamics should be described stochastically, e.g., due to inherent randomness or coarse grained/incomplete description. In that case the best coordinate that describes the progress of the disease (the best biomarker) between two end states, e.g., healthy and abnormal, is the committor. 5,26 In particular, it should accurately predict the odds of positive outcome and the mean time to achieve that. The proposed approach can be used to construct such a coordinate from an ensemble of patient trajectories in an automated way without any disease specific information.

■ APPENDIX
Properties of Z C,1 (r, Δt) Given a long RC time-series r(kΔt 0 ), Z C,1 (r′, Δt) equals half the total length the trajectory makes, when it transits through a point r′ on the RC where ∑ k r ′ denotes the sum over such k when r′ is between r(Δt + kΔt) and r (kΔt). This quantity can be computed by considering every timestep Δt = Δt 0 of the time-series, every second timestep Δt = 2Δt 0 , third, and so forth.
If the RC satisfies eq 2a, then Z C,1 (r, Δt 0 ) = const. 22 If RC satisfies 2a, then it satisfies 2a with P ij (2Δt 0 ) = ∑ k P ik (Δt 0 ) P kj (Δt 0 ) and Z C,1 (r, 2Δt 0 ) = Z C,1 (r, Δt 0 ) = const, and so forth. 22 Boundary nodes satisfy eq 2b rather than eq 2a, and if transitions over r′ visit boundary nodes, then Z C,1 (r′, Δt > Δt 0 ) ≠ Z C,1 (r′, Δt 0 ). To overcome this problem at the boundaries, a special counting method using the ensemble of transition path segments has been suggested which restores driftlessness at boundaries and makes Z C,1 constant everywhere. 22 Alternatively one can combine two identical copies of the RC into a circle in order to eliminate boundaries, 31,32 as described in the main text ( Figure 3). In this case the RC is a multivalued function, denoted as z. Below, for brevity, Z C,1 (r′, Δt) denotes the Z C,1 profile as a function of the original RC r, while Z C,1 (z′) denotes a different Z C,1 profile as a function of the multivalued RC z. Analogously eq 7, given time-series z(kΔt 0 ), Z C,1 (z′, Δt), equals half the total length the trajectory makes, when it transits through a point z′ on the RC where ∑ k z ′ denotes the sum over such k when z′ is between z(Δt + kΔt) and z(kΔt). Z C,1 (r′, Δt) is obtained by summing up over all segments or branches z m (r′) of the multivalued function z, i.e., by projecting z back to r: Figure 8. Flowchart outlines the adaptive nonparametric RC optimization algorithm. The algorithm computes the putative RC time-series r(kΔt 0 ). * points that belong to the boundary states A and B are initialized to 0 and 1, respectively. They do not change during RC optimization. # for Δt = Δt 0 the condition is Δz 2 (Δt 0 )/2 > N AB .

Journal of Chemical Theory and Computation
For the optimal RC or the committor such computed Z C,1 (r′, Δt) = N AB is constant, i.e., is independent of r′ and Δt for Δt much less than the trajectory length.
In the limit of very small Δt, Z C,1 (r, Δt) = Z H (r)D(r)Δt, where Z H (r) ∼ e −F(r)/kT is the density of trajectory points around r, F is the free energy, and D is the diffusion coefficient (5). The relation can be used to determine the diffusion coefficient for arbitrary RC. For very large Δt, Z C,1 (r, Δt) = N AB . Since for the committor coordinate Z C,1 (q, Δt) is constant for all Δt and thus is equal to N AB , one can determine the diffusion coefficient along the committor as D(q) = Z H (q) −1 N AB /Δt. Integrating eq 7 one obtains ∫ Z C,1 (r, Δt)dr = ∑ k [r(kΔt + Δt) − r(kΔt)] 2 /2, hence Z C,1 (r, Δt) can be considered as the local average density of Δr 2 /2. Difference between two profiles Z C,1 (r, 2Δt) and Z C,1 (r, Δt), averaged over some local region, is proportional to Δr 2 (2Δt) − Δr 2 (Δt) ∼ ⟨(r(t + Δt) − r(t))(r(t) − r(t − Δt))⟩, the correlation between successive displacements. Which means that the closer Z C,1 (r, Δt) profiles for different Δt, the closer the correlation is to zero. For the optimal RC or the committor, Z C,1 (q, Δt) is constant for all Δt, and the correlation is zero. If Z C,1 (r, Δt) < Z C,1 (r, Δt 0 ) for Δt > Δt 0 , then the correlation is negative and the dynamics is subdiffusive. The larger the difference, the more negative are correlations and the more suboptimal is the coordinate. If, alternatively, Z C,1 (r, Δt) > Z C,1 (r, Δt 0 ), then the correlation is positive and the dynamics is superdiffusive.
Removing Systematic Bias Due to Finite Value of the Trajectory Sampling Interval Δt 0 Table 1 compares dynamical quantities determined using the diffusive model with that computed directly from the atomistic trajectory. Equations for the equilibrium flux, the mfpt, and the mtpt (eqs 3−5) were derived assuming that the dynamics is observed with infinitely high temporal resolution. In practice, however, the trajectory is saved with a finite sampling interval Δt 0 , which means that some of the events, when the system quickly visits a boundary state and comes back, can be missed. This may lead to systematic underestimation of the number of transitions N AB and overestimation of the mfpt and the mtpt. An accurate way, which removes this systematic bias, is to compare the estimates for the same value of Δt 0 . One may either compare eqs 3−5 with the results obtained directly from trajectory in the limit of Δt 0 → 0, or one may simulate diffusive dynamics on the free energy profile F(q) and determine the equilibrium flux, the mfpt, and the mtpt, when observed with the sampling interval of Δt 0 . We followed the second option. Diffusive dynamics on the free energy profile F(q) was simulated using MC with diffusion coefficient D(q) = 1 and a timestep of 0.001Δt 0 . The simulation length was chosen to be much longer than the original trajectory, so that statistical errors are negligible. Table 2 compares the dynamical quantities computed with sampling intervals Δt 0 and 0.001Δt 0 . One can see that, indeed, the systematic differences due to the finite sampling interval are non-negligible and comparable to the differences shown in Table 1. Table 3 compares the dynamical quantities computed by simulating diffusion on the free energy profile F(q) with that computed directly from the original atomistic trajectory, both with sampling interval Δt 0 . The differences are now smaller in comparison to the corresponding differences in Table 1.   a The numbers show the latter, while percentages in the brackets show the relative difference between the two. Times are given in ns.

Journal of Chemical Theory and Computation
Article