How Fast Fast-Folding Proteins Fold in Silico

In reported microcanonical molecular dynamics simulations, fast-folding proteins CLN025 and Trp-cage autonomously folded to experimentally determined native conformations. However, the folding times of these proteins derived from the simulations were more than 4–10 times longer than their experimental values. This article reports autonomous folding of CLN025 and Trp-cage in isobaric–isothermal molecular dynamics simulations with agreements within factors of 0.69–1.75 between simulated and experimental folding times at different temperatures. These results show that CLN025 and Trp-cage can now autonomously fold in silico as fast as in experiments and suggest that the accuracy of folding simulations for fast-folding proteins begins to overlap with the accuracy of folding experiments. This opens new prospects of developing computer algorithms that can predict both ensembles of conformations and their interconversion rates of a protein from its sequence for artificial intelligence on how and when a protein acts as a receiver, switch, and relay to facilitate various subcellular-to-tissue communications. Then the genetic information that encodes proteins can be better read in the context of intricate biological functions.


Introduction
How fast can fast-folding proteins autonomously fold in silico? This question is important because experimental folding times (τs) [1][2][3] are rigorous benchmarks that can be used to evaluate the accuracy of protein folding simulations. If accurate, such simulations offer not only insight into protein folding pathways and mechanisms [4][5][6][7] but also a means to determine ensembles of conformations and their interconversion rates of a protein, which are responsible for "proteins to act as receivers, switches, and relays and facilitate communication from the subcellular level through to the cell and tissue levels" [8]. Due to approximations in the empirical potential energy functions for the folding simulations, most simulated τs reported to date have been much longer than the corresponding experimental τs. For example, molecular dynamics (MD) simulations of fast-folding proteins using a distributed computing implementation with implicit solvation yielded τs that were consistent with the corresponding experimental values if Cα root mean square deviation (CαRMSD) cutoffs of 2.5-3.0 Å and 3.622 Å were used to identify conformations that constitute the native structural ensembles [9,10].
However, according to the reported sensitivities of the simulated τs to CαRMSD cutoffs [9,10], the τs would be considerably longer than the experimental values, if typical CαRMSD cutoffs of <2.0 Å were used. For another example, advanced microcanonical MD simulations predicted τs of fast-folding proteins CLN025 [11] and Trp-cage [12] to be 600 ns at 343 K and 14 µs at 335 K, respectively [13]. These τs are of high quality as the τs were derived from the microcanonical MD simulations that resulted in the most populated conformations of CLN025 and Trp-cage with CαRMSDs of 1.0 and 1.4 Å from the experimental native conformations, respectively [13].
However, because the experimental τs of the two proteins reportedly increase as temperature decreases [1,2], the simulated τs at 300 K are conceivably more than 4-10 times longer than the 4 experimental τs. Therefore, how fast fast-folding proteins fold in silico equates to how accurate protein folding simulations can be. Most reported τs to date suggest that fast-folding proteins cannot autonomously fold in silico as fast as in experiments. This implies an accuracy gap between simulation and experiment for protein folding speed that is determined by folding mechanism or pathways [14].
To narrow the accuracy gap, a new protein simulation method was accordingly developed.
This method uses uniformly scaled atomic masses to compress or expand MD simulation time for improving configurational sampling efficiency or temporal resolution [15][16][17]. Uniformly reducing all atomic masses of a simulation system by tenfold can compress the simulation time by a factor of 10 and hence improve the configurational sampling efficiency of the low-mass simulations at temperatures of ≤340 K [16]. As detailed in Refs. [16,17], this method facilitates protein folding simulations on personal computers (such as Apple Mac Pros) under isobaricisothermal conditions at which most experimental folding studies are performed. As explained in Ref. [16], the kinetics of the low-mass simulation system can be converted to the kinetics of the standard-mass simulation system by simply scaling the low-mass time with a factor of 10 .
Subsequently, this low-mass simulation method led to the development of a revised AMBER forcefield that has shown improvements in (i) autonomously folding fast-folding proteins, (ii) simulating genuine localized disorders of folded globular proteins, and (iii) refining comparative models of monomeric globular proteins [18][19][20]. Hereafter the combination of the revised AMBER forcefield with the low-mass simulation method is termed FF12MC [18]. states of the zebrafish to the unfolded and folded states of a protein inspired the use of the R survival package to predict τs from sequences as follows [16,18]: Perform (i) ≥20 distinct and independent MD simulations to autonomously fold a fast-folding protein sequence using FF12MC, which results in ≥20 sets of instantaneous protein conformations in time, (ii) a cluster analysis of all instantaneous conformations from the ≥20 sets to obtain the average conformation of the largest cluster as the predicted native conformation of the protein, and (iii) a survival analysis using the ≥20 sets of conformations in time and the predicted native conformation to determine the mean τ and its 95%CI. As exemplified in Refs. [16,18], one advantage of this survival analysis method is that the τ prediction does assume that the fast-folding protein must follow a two-state folding mechanism; another advantage is rigorous estimation of mean τ and 95%CI from ≥20 simulations that are relatively short so that a few of these simulations may not capture a folding event.
As demonstrated below, use of the methods and forcefield outlined above resulted in accurate prediction of τs for CLN025 and Trp-cage (TC10b) and offered an answer to the question of how fast fast-folding protein fold in silico. A total of 160 distinct, independent, unrestricted, unbiased, isobaric-isothermal, microsecond MD simulations with a total aggregated simulation time of 1,011.2 µs were used the prediction. Hereafter, all simulations for each protein at a specified temperature are 40 distinct, independent, unrestricted, unbiased, and isobaric-isothermal MD simulations with FF12MC using a fully extended backbone conformation as the initial conformation of the protein for the 40 simulations. Also all simulation times described hereafter have been converted to standard-mass simulation times. 6

Molecular dynamics simulations
A fast-folding protein in a fully extended backbone conformation was solvated with the TIP3P water [22] with surrounding counter ions and/or NaCls and then energy-minimized for 100 cycles of steepest-descent minimization followed by 900 cycles of conjugate-gradient minimization to remove close van der Waals contacts using SANDER of AMBER 11 (University of California, San Francisco). The resulting system was heated from 5 to a temperature of 280-300 K at a rate of 10 K/ps under constant temperature and constant volume, then equilibrated for 10 6 timesteps under constant temperature and constant pressure of 1 atm employing isotropic molecule-based scaling, and finally simulated in 40 distinct, independent, unrestricted, unbiased, and isobaric-isothermal MD simulations using PMEMD of AMBER 11 with a periodic boundary condition at 280-300 K and 1 atm. The fully extended backbone conformations (viz., anti-parallel β-strand conformations) were generated by MacPyMOL Version 1.5.0 (Schrödinger LLC, Portland, OR). The numbers of TIP3P waters and surrounding ions, initial solvation box size, and ionizable residues are provided in Table S1. The 40 unique seed numbers for initial velocities of Simulations 1-40 are listed in Table S2. All simulations used (i) a dielectric constant of 1.0, (ii) the Berendsen coupling algorithm [23], (iii) the Particle Mesh Ewald method to calculate electrostatic interactions of two atoms at a separation of >8 Å [24], (iv) Δt = 1.00 fs of the standard-mass time [18], (v) the SHAKE-bond-length constraints applied to all bonds involving hydrogen, (vi) a protocol to save the image closest to the middle of the "primary box" to the restart and trajectory files, (vii) a formatted restart file, (viii) the revised alkali and halide ions parameters [25], (ix) a cutoff of 8.0 Å for nonbonded interactions,

Folding time estimation
The τ of a fast-folding protein was estimated from the mean time-to-folding in 40 distinct, independent, unrestricted, unbiased, and isobaric-isothermal MD simulations using survival analysis methods [21] implemented in the R survival package Version 2.38-3 (http://cran.rproject.org/package=survival). A CαβRMSD cutoff of 0.98 Å was used to identify conformations that constitute the native structural ensemble. For each simulation with conformations saved at every 10 5 timesteps, the first time-instant at which CαβRMSD reached ≤0.98 Å was recorded as an individual folding time (Table S3). Using the Kaplan-Meier estimator [26,27]  folding of a second set of simulations that were identical to the first set except that the simulation temperature of the second set was changed.

Cluster analysis and data processing
The conformational cluster analyses of CLN025 and TC10b were performed using CPPTRAJ of AmberTools 16 with the average-linkage algorithm [28], epsilon of 2.0 Å, and root mean square coordinate deviation on all Cα and Cβ atoms (see Table S4). No energy minimization was performed on the average conformation of any cluster. The linear regression analysis was performed using the PRISM 5 program for Mac OS X, Version 5.0d (GraphPad Software, La Jolla, California). 8

Simulated folding times of β-protein CLN025 at different temperatures
To determine how fast β-protein CLN025 autonomously folds in silico, 40 3.16-µs MD simulations of CLN025 were performed at 300 K. A cluster analysis of these simulations revealed that the average conformation in the largest cluster adopted a β-hairpin conformation  (Table S3A). CαβRMSD was used here to determine the individual folding time because it is more stringent to measure structural similarity than CαRMSD due to inclusion of both main-chain and side-chain structural information in CαβRMSD. Plotting the natural logarithm of the nonnative state population of CLN025 versus time-to-folding revealed a linear relationship with r 2 of 0.97 (Fig.   2), which indicates that CLN025 follows a two-state folding mechanism. Using the 40 individual folding times as times-to-folding, a survival analysis predicted the τ of CLN025 to be 198 ns (95%CI = 146-270 ns; n = 40) at 300 K (Table 1). These results agree with the experimental finding that the folding of CLN025 followed a two-state folding mechanism with a τ of 137 ns at 300 K, where the τ was obtained from Fig. 6 of Ref. [1]. To substantiate the agreement between the experimental and computational τs at 300 K, the 40 CLN025 simulations were repeated at 293 K. Using the same CαβRMSD cutoff and the same predicted native conformation, a 9 survival analysis showed that CLN025 followed a two-state folding mechanism (r 2 = 0.94; Fig. 2) with a τ of 279 ns (95%CI = 204-380 ns; n = 40) at 293 K (Table 1). This outcome again agrees with the experimental τ of 261 ns at 293 K that was also obtained from Fig. 6 of Ref. [1].  (Table 1). These results are consistent with the experimentally determined two-state folding mechanism and a τ of 2.4 µs (95%CI = 1.6-3.2 µs) at 280 K for Trp-cage. The experimental τ was extrapolated from Fig. 4 of NMR lnk F in Ref. [2].

Simulated folding times of α-protein
The experimental 95%CI was calculated from the reported errors of ±0.18 for lnk F in the 12-14 range [2] according to the standard method for propagation of errors of precision [29]. Repeating the 40 simulations of TC10b at 300 K and the survival analysis using the same predicted native conformation and the CαβRMSD cutoff of 0.98 Å revealed a two-state folding mechanism (r 2 = 0.96 in Fig. 2) and a τ of 0.8 µs (95%CI = 0.6-1.0 µs; n = 40; Table 1), which is also consistent with the experimental τ of 1.4 µs (95%CI = 0.8-2.0 µs) at 300 K [2]. 10 The simulated τ may not be converged and useful if insufficient numbers of simulations are used to predict the native conformations and τ or if lax CαRMSDs are used to define the native structural ensemble. To assess the convergence of the simulated τs of CLN025 and Trp-cage described above, the average conformations of the largest cluster (viz., the simulated native conformations) for CLN025 and Trp-cage shown in Fig. 1 and the corresponding simulated τs listed in Table 1 were re-generated using 20 or 30 simulations from Simulations 1-20 or 1-30, respectively. According to the CαβRMSDs in Table S4 (Table S5). These results indicate that 40 simulations are sufficient.

Convergence of the simulated folding times
In the present study, a stringent CαβRMSD of ≤0.98 Å for the full sequence of CLN025 or Trp-cage was used to define the native structural ensemble, in contrast to the use of CαRMSD for a fast-folding protein with truncations on terminal residues. However, using an "overly" stringent CαβRMSD cutoff of 0.98 Å may lengthen the simulated τ, whereas using the average rather than the representative conformation of the largest cluster as the predicted native conformation may shorten the simulated τ. To address these concerns, all τs in Table 1 were reestimated from the same simulations using both the average and representative conformations with CαβRMSD cutoffs that varied from 0.98 Å to 1.40 Å. As apparent from Table S6, the τs of CLN025 and Trp-cage are insensitive to the change from the average to the representative conformation, and these τs are also insensitive to the variation of the CαβRMSD cutoff within 0.98-1.40 Å. In addition, all τs in Table 1 were determined from trajectories that revealed not only hazard functions of a two-state folding mechanism for both CLN025 and Trp-cage (Fig. 2) 11 but also their consistent folding events that are traceable in exemplary Videos S1 and S2. These theoretical mechanism and folding events are consistent with their experimentally determined folding mechanism and native conformations. These results indicate that the simulated τs in Table 1 are converged and therefore may be used to explain, confirm, or predict folding rates of CLN025 and Trp-cage.

Significance of the simulated folding times
The present study shows that agreements within factors of 0.69-1.75 between the experimental and simulated τs have been achieved for CLN025 and Trp-cage (Table 1). These agreements indicate that fast-folding proteins CLN025 and Trp-cage can now autonomously fold in simulations as fast as in experiments and provide an answer to the question of how fast fast-folding proteins fold in silico. These agreements also suggest that the accuracy of folding simulations for fast-folding proteins is beginning to overlap with the accuracy of folding experiments. This opens new prospects of combining simulation with experiment to develop computer algorithms that predict ensembles of conformations and their interconversion rates of a protein from its sequence. Such algorithms can improve the artificial intelligence on how and when "proteins act as receivers, switches, and relays and facilitate communication from the subcellular level through to the cell and tissue levels" [8]. Then the genetic information that encodes proteins can be better read in the context of intricate biological functions.

Conflict of interest
The author has not conflict of interest.   Table   S3A.